# Description

learn pyspark basics from a blog

Follow this blogpost https://towardsdatascience.com/the-hitchhikers-guide-to-handle-big-data-using-spark-90b9be0fe89a


* RDD Programming Guide http://spark.apache.org/docs/latest/rdd-programming-guide.html#actions

Follow up

See coursera courses
* Big Data Essentials: HDFS, MapReduce and Spark RDD
    * https://www.coursera.org/learn/big-data-analysis?ranMID=40328&ranEAID=lVarvwc5BD0&ranSiteID=lVarvwc5BD0-Y2ZYRU0eP2qngfq6ffGH2Q&siteID=lVarvwc5BD0-Y2ZYRU0eP2qngfq6ffGH2Q&utm_content=2&utm_medium=partners&utm_source=linkshare&utm_campaign=lVarvwc5BD0
* Big Data Analysis: Hive, Spark SQL, DataFrames and GraphFrames
    * https://www.coursera.org/learn/big-data-analysis?ranMID=40328&ranEAID=lVarvwc5BD0&ranSiteID=lVarvwc5BD0-Y2ZYRU0eP2qngfq6ffGH2Q&siteID=lVarvwc5BD0-Y2ZYRU0eP2qngfq6ffGH2Q&utm_content=2&utm_medium=partners&utm_source=linkshare&utm_campaign=lVarvwc5BD0

# Imports

In [1]:
import pyspark

# Setup spark context

In [2]:
sc

''

In [3]:
sc = pyspark.SparkContext('local[*]')

In [4]:
sc

In [35]:
sc.environment

{'PYTHONHASHSEED': '0'}

In [38]:
sc.applicationId

'local-1590729753223'

In [39]:
sc.appName

'pyspark-shell'

In [40]:
sc.version

'2.4.5'

In [41]:
sc.pythonVer

'3.7'

In [43]:
sc.sparkHome

In [46]:
sc.startTime

1590729747186

In [48]:
sc.startTime

1590729747186

# Read in Macbeth

In [5]:
ls

[1m[36mData-ML-100k--master[m[m/ Untitled.ipynb        requirements.txt
README.md             Untitled.py
[1m[36mShakespearePlaysPlus[m[m/ Untitled.txt


In [8]:
lines = sc.textFile('ShakespearePlaysPlus/tragedies/Macbeth.txt')

In [9]:
print(lines.count())

9879


In [18]:
# Create a list with all words
# create tuple (word, 1)
# reduce by key (i.e. the word)

counts = (lines.flatMap(lambda x: x.replace('\x00', '').split(' '))
           .map(lambda x: (x, 1))
           .reduceByKey(lambda x, y : x + y))

In [19]:
counts

PythonRDD[22] at RDD at PythonRDD.scala:53

In [22]:
# get the output on local
output = counts.take(10)

output

[('��<', 1),
 ('Shakespeare', 1),
 ('>', 6),
 ('', 5719),
 ('of', 319),
 ('Liberty', 1),
 ('(http://oll.libertyfund.org)', 1),
 ('Unicode', 1),
 ('version', 1),
 ('Scott', 1)]

In [23]:
# print output
for word, count in output:
  print(f'{word}: {count:d}')

��<: 1
Shakespeare: 1
>: 6
: 5719
of: 319
Liberty: 1
(http://oll.libertyfund.org): 1
Unicode: 1
version: 1
Scott: 1


# Map

In [24]:
my_list = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

# Square every term in my_list
squared_list = map(lambda x: x**2, my_list)
print(list(squared_list))

[1, 4, 9, 16, 25, 36, 49, 64, 81, 100]


# filter

In [25]:
# keep only even numbers

filtered_list = filter(lambda x: x%2 == 0, my_list)
print(list(filtered_list))

[2, 4, 6, 8, 10]


# reduce

In [26]:
import functools
my_list = [1,2,3,4,5]

# sum all elements in my_list
sum_list = functools.reduce(lambda x,y: x + y, my_list)
sum_list

15

# RDD

## parallelize

In [28]:
data = [1,2,3,4,5,6,7,8,9,10]

new_rdd = sc.parallelize(data, 4)
new_rdd

ParallelCollectionRDD[25] at parallelize at PythonRDD.scala:195

# Two operations - Transformation and Action

* Transformation - create new dataset from existing RDD
*  Action - mechanism to get results out of sparkj

In [32]:
type(lines)

pyspark.rdd.RDD

In [33]:
type(counts)

pyspark.rdd.PipelinedRDD

In [34]:
type(output)

list

# Transformations

http://spark.apache.org/docs/latest/rdd-programming-guide.html#transformations

## Map

In [58]:
data = [1,2,3,4,5,6,7,8,9,10]
rdd = sc.parallelize(data, 4)
squared_rdd = rdd.map(lambda x: x ** 2)

squared_rdd.collect()

[1, 4, 9, 16, 25, 36, 49, 64, 81, 100]

In [59]:
squared_rdd

PythonRDD[33] at collect at <ipython-input-58-426c84408c9f>:5

## Filter

Return those elements that fulfill the condition

In [61]:
data = [1,2,3,4,5,6,7,8,9,10]
rdd = sc.parallelize(data, 4)
filtered_rdd = rdd.filter(lambda x: x % 2 == 0)
filtered_rdd.collect()

[2, 4, 6, 8, 10]

## Distinct

return only distinct elemens of an RDD

In [66]:
data = [1,2,2,2,2,3,3,3,3,4,5,6,7,7,7,8,8,8,9,10]
rdd = sc.parallelize(data, 4)
distinct_rdd = rdd.distinct()
distinct_rdd.collect()

[4, 8, 1, 5, 9, 2, 6, 10, 3, 7]

## flatmap

Similar to map, but each input item can be mapped to 0 or more output items.



In [67]:
data = [1,2,3,4]
rdd = sc.parallelize(data, 4)
flat_rdd = rdd.flatMap(lambda x: [x, x**3])
flat_rdd.collect()

[1, 1, 2, 8, 3, 27, 4, 64]

## Reduce By Key

The parallel to the reduce in Hadoop MapReduce.

Now Spark cannot provide the value if it just worked with Lists.

In Spark, there is a concept of pair RDDs that makes it a lot more flexible. Let's assume we have a data in which we have a product, its category, and its selling price. We can still parallelize the data.

In [68]:
data = [('Apple','Fruit',200),
        ('Banana','Fruit',24),
        ('Tomato','Fruit',56),
        ('Potato','Vegetable',103),
        ('Carrot','Vegetable',34)]

rdd = sc.parallelize(data, 4)

Right now our RDD rdd holds tuples.

Now we want to find out the total sum of revenue that we got from each category.

To do that we have to transform our rdd to a pair rdd so that it only contains key-value pairs/tuples.

In [69]:
category_price_rdd = rdd.map(lambda x: (x[1], x[2]))
category_price_rdd.collect()

[('Fruit', 200),
 ('Fruit', 24),
 ('Fruit', 56),
 ('Vegetable', 103),
 ('Vegetable', 34)]

Here we used the map function to get it in the format we wanted. When working with textfile, the RDD that gets formed has got a lot of strings. We use map to convert it into a format that we want.

So now our category_price_rdd contains the product category and the price at which the product sold.

Now we want to reduce on the key category and sum the prices. We can do this by:

In [70]:
category_total_price_rdd = category_price_rdd.reduceByKey(lambda x, y: x + y)
category_total_price_rdd.collect()

[('Fruit', 280), ('Vegetable', 137)]

## Group By Key

Similar to reduceByKey but does not reduces just puts all the elements in an iterator. For example, if we wanted to keep as key the category and as the value all the products we would use this function.

Let us again use map to get data in the required form.

In [73]:
ata = [('Apple','Fruit',200),
       ('Banana','Fruit',24),
       ('Tomato','Fruit',56),
       ('Potato','Vegetable',103),
       ('Carrot','Vegetable',34)]
rdd = sc.parallelize(data, 4)
category_product_rdd = rdd.map(lambda x: (x[1], x[0]))
category_product_rdd.collect()

[('Fruit', 'Apple'),
 ('Fruit', 'Banana'),
 ('Fruit', 'Tomato'),
 ('Vegetable', 'Potato'),
 ('Vegetable', 'Carrot')]

We then use groupByKey as:

In [74]:
grouped_products_by_category_rdd = category_product_rdd.groupByKey()
findata = grouped_products_by_category_rdd.collect()
for data in findata:
    print(data[0], list(data[1]))

Fruit ['Apple', 'Banana', 'Tomato']
Vegetable ['Potato', 'Carrot']


In [75]:
findata

[('Fruit', <pyspark.resultiterable.ResultIterable at 0x1210206a0>),
 ('Vegetable', <pyspark.resultiterable.ResultIterable at 0x121020710>)]

Here the groupByKey function worked and it returned the category and the list of products in that category.

# Action Basics

http://spark.apache.org/docs/latest/rdd-programming-guide.html#actions

You have filtered your data, mapped some functions on it. Done your computation.

Now you want to get the data on your local machine or save it to a file or show the results in the form of some graphs in excel or any visualization tool.

You will need actions for that. A comprehensive list of actions is provided here http://spark.apache.org/docs/latest/rdd-programming-guide.html#actions.

Some of the most common actions that I tend to use are:


## collect
We have already used this action many times. It takes the whole RDD and brings it back to the driver program.

## reduce

Aggregate the elements of the dataset using a function func (which takes two arguments and returns one). The function should be commutative and associative so that it can be computed correctly in parallel.

In [79]:
rdd = sc.parallelize([1,2,3,4,5])
rdd.reduce(lambda x,y : x+y)

15

## take
Sometimes you will need to see what your RDD contains without getting all the elements in memory itself. take returns a list with the first n elements of the RDD.

In [80]:
rdd = sc.parallelize([1,2,3,4,5])
rdd.take(3)

[1, 2, 3]

## takeOrdered
takeOrdered returns the first n elements of the RDD using either their natural order or a custom comparator.

In [82]:
rdd = sc.parallelize([5,3,12,23])
# descending order
rdd.takeOrdered(3, lambda s: -1*s)


[23, 12, 5]

In [83]:
rdd = sc.parallelize([(5,23),(3,34),(12,344),(23,29)])
# descending order
rdd.takeOrdered(3, lambda s: -1*s[1])

[(12, 344), (3, 34), (23, 29)]

# Understanding The WordCount Example

Now we sort of understand the transformations and the actions provided to us by Spark.

It should not be difficult to understand the wordcount program now. Let us go through the program line by line.

The first line creates an RDD and distributes it to the workers.

In [84]:
lines = sc.textFile('ShakespearePlaysPlus/tragedies/Macbeth.txt')

This RDD lines contains a list of sentences in the file. You can see the rdd content using take

In [85]:
lines.take(5)

['��<\x00 \x00S\x00h\x00a\x00k\x00e\x00s\x00p\x00e\x00a\x00r\x00e\x00 \x00-\x00-\x00 \x00M\x00A\x00C\x00B\x00E\x00T\x00H\x00 \x00>\x00',
 '\x00',
 '\x00<\x00 \x00f\x00r\x00o\x00m\x00 \x00O\x00n\x00l\x00i\x00n\x00e\x00 \x00L\x00i\x00b\x00r\x00a\x00r\x00y\x00 \x00o\x00f\x00 \x00L\x00i\x00b\x00e\x00r\x00t\x00y\x00 \x00(\x00h\x00t\x00t\x00p\x00:\x00/\x00/\x00o\x00l\x00l\x00.\x00l\x00i\x00b\x00e\x00r\x00t\x00y\x00f\x00u\x00n\x00d\x00.\x00o\x00r\x00g\x00)\x00 \x00>\x00',
 '\x00',
 '\x00<\x00 \x00U\x00n\x00i\x00c\x00o\x00d\x00e\x00 \x00.\x00t\x00x\x00t\x00 \x00v\x00e\x00r\x00s\x00i\x00o\x00n\x00 \x00b\x00y\x00 \x00M\x00i\x00k\x00e\x00 \x00S\x00c\x00o\x00t\x00t\x00 \x00(\x00h\x00t\x00t\x00p\x00:\x00/\x00/\x00w\x00w\x00w\x00.\x00l\x00e\x00x\x00i\x00c\x00a\x00l\x00l\x00y\x00.\x00n\x00e\x00t\x00)\x00 \x00>\x00']

This next line is actually the workhorse function in the whole script.

In [86]:
counts = (lines.flatMap(lambda x: x.replace('\x00', '').split(' '))
                  .map(lambda x: (x, 1))
                  .reduceByKey(lambda x, y : x + y))

It contains a series of transformations that we do to the lines RDD. First of all, we do a flatmap transformation.

The flatmap transformation takes as input the lines and gives words as output. So after the flatmap transformation, the RDD is of the form:

`['word1','word2','word3','word4','word3','word2']`


Next, we do a map transformation on the flatmap output which converts the RDD to :

`[('word1',1),('word2',1),('word3',1),('word4',1),('word3',1),('word2',1)]`


Finally, we do a reduceByKey transformation which counts the number of time each word appeared.

After which the RDD approaches the final desirable form.

`[('word1',1),('word2',2),('word3',2),('word4',1)]`


This next line is an action that takes the first 10 elements of the resulting RDD locally.

In [87]:
output = counts.take(10)

This line just prints the output

In [88]:
# print output
for word, count in output:
  print(f'{word}: {count:d}')

��<: 1
Shakespeare: 1
>: 6
: 5719
of: 319
Liberty: 1
(http://oll.libertyfund.org): 1
Unicode: 1
version: 1
Scott: 1


And that is it for the wordcount program. Hope you understand it now.

So till now, we talked about the Wordcount example and the basic transformations and actions that you could use in Spark. But we don’t do wordcount in real life.

We have to work on bigger problems which are much more complex. Worry not! Whatever we have learned till now will let us do that and more

# Spark in Action with Example

Let us work with a concrete example which takes care of some usual transformations.

We will work on Movielens ml-100k.zip dataset which is a stable benchmark dataset. 100,000 ratings from 1000 users on 1700 movies. Released 4/1998.

The Movielens dataset contains a lot of files but we are going to be working with 3 files only:

1) Users: This file name is kept as “u.user”, The columns in this file are:

    ['user_id', 'age', 'sex', 'occupation', 'zip_code']

2) Ratings: This file name is kept as “u.data”, The columns in this file are:

    ['user_id', 'movie_id', 'rating', 'unix_timestamp']

3) Movies: This file name is kept as “u.item”, The columns in this file are:

    ['movie_id', 'title', 'release_date', 'video_release_date', 'imdb_url', and 18 more columns.....]

Let us start by importing these 3 files into our spark instance using ‘Import and Explore Data’ on the home tab.


import the files u.user, u.data, u.item

## Find 25 most rated movies

Our business partner now comes to us and asks us to find out the 25 most rated movie titles from this data. How many times a movie has been rated?

Let us load the data in different RDDs and see what the data contains.

In [89]:
ls

[1m[36mData-ML-100k--master[m[m/   Untitled.py             pyspark-tutorial.py
README.md               Untitled.txt            requirements.txt
[1m[36mShakespearePlaysPlus[m[m/   pyspark-tutorial.ipynb


In [90]:
userRDD = sc.textFile('Data-ML-100k--master/ml-100k/u.user') 
ratingRDD = sc.textFile('Data-ML-100k--master/ml-100k/u.data') 
movieRDD = sc.textFile('Data-ML-100k--master/ml-100k/u.item') 

print("userRDD:", userRDD.take(1))
print("ratingRDD:", ratingRDD.take(1))
print("movieRDD:", movieRDD.take(1))

userRDD: ['1|24|M|technician|85711']
ratingRDD: ['196\t242\t3\t881250949']
movieRDD: ['1|Toy Story (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)|0|0|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0']


We note that to answer this question we will need to use the ratingRDD. But the ratingRDD does not have the movie name.

So we would have to merge movieRDD and ratingRDD using movie_id.


In [92]:
# Create a RDD from RatingRDD that only contains the two columns of interest i.e. movie_id,rating.

RDD_movie_rating = ratingRDD.map(lambda x: (x.split('\t')[1], x.split('\t')[2]))
print(f'RDD_movie_rating: {RDD_movie_rating.take(4)}')

RDD_movie_rating: [('242', '3'), ('302', '3'), ('377', '1'), ('51', '2')]


In [96]:
# Create a RDD from MovieRDD that only contains the two columns of interest i.e. movie_id,title.
RDD_movie_title = movieRDD.map(lambda x: (x.split('|')[0], x.split('|')[1]))
print(f'RDD_movie_title: {RDD_movie_title.take(2)}')

RDD_movie_title: [('1', 'Toy Story (1995)'), ('2', 'GoldenEye (1995)')]


In [98]:
# merge these two pair RDDs based on movie_id. For this we will use the transformation leftOuterJoin(). See the transformation document.
rdd_movie_title_rating = RDD_movie_rating.leftOuterJoin(RDD_movie_title)
print(f'rdd_movie_title_rating: {rdd_movie_title_rating.take(1)}')

rdd_movie_title_rating: [('346', ('1', 'Jackie Brown (1997)'))]


In [99]:
# use the RDD in previous step to create (movie,1) tuple pair RDD
rdd_title_rating = rdd_movie_title_rating.map(lambda x: (x[1][1], 1))
print(f'rdd_title_rating: {rdd_title_rating.take(1)}')

rdd_title_rating: [('Jackie Brown (1997)', 1)]


In [100]:
# Use the reduceByKey transformation to reduce on the basis of movie_title
rdd_title_rating_cnt = rdd_title_rating.reduceByKey(lambda x, y: x + y)
print(f'rdd_title_rating_cnt: {rdd_title_rating_cnt.take(2)}')

rdd_title_rating_cnt: [('Jackie Brown (1997)', 126), ('Jungle Book, The (1994)', 85)]


In [105]:
# Get the final answer by using takeOrdered Transformation
print('#' * 25)
print('25 most rated movies')
[print(x[1], x[0]) for x in rdd_title_rating_cnt.takeOrdered(25, lambda x: -x[1])]
print('#' * 25)

#########################
25 most rated movies
583 Star Wars (1977)
509 Contact (1997)
508 Fargo (1996)
507 Return of the Jedi (1983)
485 Liar Liar (1997)
481 English Patient, The (1996)
478 Scream (1996)
452 Toy Story (1995)
431 Air Force One (1997)
429 Independence Day (ID4) (1996)
420 Raiders of the Lost Ark (1981)
413 Godfather, The (1972)
394 Pulp Fiction (1994)
392 Twelve Monkeys (1995)
390 Silence of the Lambs, The (1991)
384 Jerry Maguire (1996)
379 Chasing Amy (1997)
378 Rock, The (1996)
367 Empire Strikes Back, The (1980)
365 Star Trek: First Contact (1996)
350 Back to the Future (1985)
350 Titanic (1997)
344 Mission: Impossible (1996)
336 Fugitive, The (1993)
331 Indiana Jones and the Last Crusade (1989)
#########################


Star Wars is the most rated movie in the Movielens Dataset.

Now we could have done all this in a single command using the below command but the code is a little messy now.

I did this to show that you can use chaining functions with Spark and you could bypass the process of variable creation.


### Rewrite as chained function

In [113]:
print((ratingRDD.map(lambda x : (x.split("\t")[1],x.split("\t")[2])))
       .leftOuterJoin(movieRDD.map(lambda x : (x.split("|")[0],x.split("|")[1])))
       .map(lambda x: (x[1][1],1))
       .reduceByKey(lambda x,y: x+y)
      .takeOrdered(25,lambda x:-x[1]))

[('Star Wars (1977)', 583), ('Contact (1997)', 509), ('Fargo (1996)', 508), ('Return of the Jedi (1983)', 507), ('Liar Liar (1997)', 485), ('English Patient, The (1996)', 481), ('Scream (1996)', 478), ('Toy Story (1995)', 452), ('Air Force One (1997)', 431), ('Independence Day (ID4) (1996)', 429), ('Raiders of the Lost Ark (1981)', 420), ('Godfather, The (1972)', 413), ('Pulp Fiction (1994)', 394), ('Twelve Monkeys (1995)', 392), ('Silence of the Lambs, The (1991)', 390), ('Jerry Maguire (1996)', 384), ('Chasing Amy (1997)', 379), ('Rock, The (1996)', 378), ('Empire Strikes Back, The (1980)', 367), ('Star Trek: First Contact (1996)', 365), ('Back to the Future (1985)', 350), ('Titanic (1997)', 350), ('Mission: Impossible (1996)', 344), ('Fugitive, The (1993)', 336), ('Indiana Jones and the Last Crusade (1989)', 331)]


## Find 25 most highly rated movies

Let us do one more. For practice:

Now we want to find the most highly rated 25 movies using the same dataset. We actually want only those movies which have been rated at least 100 times.


In [118]:
rdd_movie_title_rating.first()

('346', ('1', 'Jackie Brown (1997)'))

In [125]:
# We create an RDD that contains sum of all the ratings for a particular movie
rdd_title_ratings_sum = (
    rdd_movie_title_rating
    .map(lambda x: (x[1][1], int(x[1][0])))
    .reduceByKey(lambda x, y: x + y))
print(f'rdd_title_ratings_sum: {rdd_title_ratings_sum.take(2)}')

rdd_title_ratings_sum: [('Jackie Brown (1997)', 459), ('Jungle Book, The (1994)', 303)]


In [134]:
# Merge this data with the RDD rdd_title_ratingcnt we created in the last step
# And use Map function to divide ratingsum by rating count.

rdd_title_ratings_mean_rating_count = (
    rdd_title_ratings_sum
    .leftOuterJoin(rdd_title_rating_cnt)
    .map(lambda x: (x[0], x[1][0] / x[1][1], x[1][1]))
)

print(f'rdd_title_ratings_mean_rating_count: {rdd_title_ratings_mean_rating_count.take(1)}')

rdd_title_ratings_mean_rating_count: [('Jackie Brown (1997)', 3.642857142857143, 126)]


In [137]:
# We could use take ordered here only but we want to only get the movies which have count
# of ratings more than or equal to 100 so lets filter the data RDD.
rdd_title_rating_rating_ct_gt_100 = (rdd_title_ratings_mean_rating_count
    .filter(lambda x: x[2] >= 100))

print(f'rdd_title_rating_rating_ct_gt_100: {rdd_title_rating_rating_ct_gt_100.take(1)}')

rdd_title_rating_rating_ct_gt_100: [('Jackie Brown (1997)', 3.642857142857143, 126)]


In [138]:
# Get the final answer by using takeOrdered Transformation
print('#' * 25)
print('25 highly rated movies')
[print(x[1], x[0]) for x in rdd_title_rating_rating_ct_gt_100.takeOrdered(25, lambda x: -x[1])]
print('#' * 25)

#########################
25 highly rated movies
4.491071428571429 Close Shave, A (1995)
4.466442953020135 Schindler's List (1993)
4.466101694915254 Wrong Trousers, The (1993)
4.45679012345679 Casablanca (1942)
4.445229681978798 Shawshank Redemption, The (1994)
4.3875598086124405 Rear Window (1954)
4.385767790262173 Usual Suspects, The (1995)
4.3584905660377355 Star Wars (1977)
4.344 12 Angry Men (1957)
4.292929292929293 Citizen Kane (1941)
4.292237442922374 To Kill a Mockingbird (1962)
4.291666666666667 One Flew Over the Cuckoo's Nest (1975)
4.28974358974359 Silence of the Lambs, The (1991)
4.284916201117318 North by Northwest (1959)
4.283292978208232 Godfather, The (1972)
4.265432098765432 Secrets & Lies (1996)
4.262626262626263 Good Will Hunting (1997)
4.259541984732825 Manchurian Candidate, The (1962)
4.252577319587629 Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1963)
4.252380952380952 Raiders of the Lost Ark (1981)
4.251396648044692 Vertigo (1958)
4.24571

We have talked about RDDs till now as they are very powerful.

You can use RDDs to work with non-relational databases too.

They let you do a lot of things that you couldn’t do with SparkSQL?

Yes, you can use SQL with Spark too which I am going to talk about now.

# Spark DataFrames

https://docs.databricks.com/spark/latest/dataframes-datasets/introduction-to-dataframes-python.html#

Spark has provided DataFrame API for us Data Scientists to work with relational data. Here is the documentation for the adventurous folks.

Remember that in the background it still is all RDDs and that is why the starting part of this post focussed on RDDs.

I will start with some common functionalities you will need to work with Spark DataFrames. Would look a lot like Pandas with some syntax changes.


## Setup `SparkSession`

In [143]:
from pyspark.sql import SparkSession

In [144]:
spark = (SparkSession
        .builder
        .appName('Python Spark SQL basic example')
        .config('spark.some.config.option', 'some-value')
        .getOrCreate())

In [145]:
spark

## Reading the file

In [147]:
ratings = spark.read.load('Data-ML-100k--master/ml-100k/u.data', 
                          format='csv',
                          sep='\t',
                          inferSchema='true',
                          header='false')

## Show File

In [148]:
ratings.show()

+---+----+---+---------+
|_c0| _c1|_c2|      _c3|
+---+----+---+---------+
|196| 242|  3|881250949|
|186| 302|  3|891717742|
| 22| 377|  1|878887116|
|244|  51|  2|880606923|
|166| 346|  1|886397596|
|298| 474|  4|884182806|
|115| 265|  2|881171488|
|253| 465|  5|891628467|
|305| 451|  3|886324817|
|  6|  86|  3|883603013|
| 62| 257|  2|879372434|
|286|1014|  5|879781125|
|200| 222|  5|876042340|
|210|  40|  3|891035994|
|224|  29|  3|888104457|
|303| 785|  3|879485318|
|122| 387|  5|879270459|
|194| 274|  2|879539794|
|291|1042|  4|874834944|
|234|1184|  2|892079237|
+---+----+---+---------+
only showing top 20 rows



In [149]:
display(ratings)

DataFrame[_c0: int, _c1: int, _c2: int, _c3: int]

## Change column names

In [150]:
ratings = ratings.toDF??

In [151]:
ratings = ratings.toDF(*['user_id', 'movie_id', 'rating', 'unix_timestamp'])
display(ratings)

DataFrame[user_id: int, movie_id: int, rating: int, unix_timestamp: int]

In [152]:
ratings.show()

+-------+--------+------+--------------+
|user_id|movie_id|rating|unix_timestamp|
+-------+--------+------+--------------+
|    196|     242|     3|     881250949|
|    186|     302|     3|     891717742|
|     22|     377|     1|     878887116|
|    244|      51|     2|     880606923|
|    166|     346|     1|     886397596|
|    298|     474|     4|     884182806|
|    115|     265|     2|     881171488|
|    253|     465|     5|     891628467|
|    305|     451|     3|     886324817|
|      6|      86|     3|     883603013|
|     62|     257|     2|     879372434|
|    286|    1014|     5|     879781125|
|    200|     222|     5|     876042340|
|    210|      40|     3|     891035994|
|    224|      29|     3|     888104457|
|    303|     785|     3|     879485318|
|    122|     387|     5|     879270459|
|    194|     274|     2|     879539794|
|    291|    1042|     4|     874834944|
|    234|    1184|     2|     892079237|
+-------+--------+------+--------------+
only showing top

## Some basic stats

In [154]:
print(f'row count:    {ratings.count()}') 
print(f'column count: {len(ratings.columns)}')

row count:    100000
column count: 4


In [156]:
ratings.describe().show()

+-------+------------------+------------------+------------------+-----------------+
|summary|           user_id|          movie_id|            rating|   unix_timestamp|
+-------+------------------+------------------+------------------+-----------------+
|  count|            100000|            100000|            100000|           100000|
|   mean|         462.48475|         425.53013|           3.52986|8.8352885148862E8|
| stddev|266.61442012750905|330.79835632558473|1.1256735991443214|5343856.189502848|
|    min|                 1|                 1|                 1|        874724710|
|    max|               943|              1682|                 5|        893286638|
+-------+------------------+------------------+------------------+-----------------+



In [157]:
ratings.describe().toPandas()

Unnamed: 0,summary,user_id,movie_id,rating,unix_timestamp
0,count,100000.0,100000.0,100000.0,100000.0
1,mean,462.48475,425.53013,3.52986,883528851.48862
2,stddev,266.61442012750905,330.79835632558473,1.1256735991443214,5343856.189502848
3,min,1.0,1.0,1.0,874724710.0
4,max,943.0,1682.0,5.0,893286638.0


## Select a few columns

In [158]:
ratings.select('user_id', 'movie_id').show()

+-------+--------+
|user_id|movie_id|
+-------+--------+
|    196|     242|
|    186|     302|
|     22|     377|
|    244|      51|
|    166|     346|
|    298|     474|
|    115|     265|
|    253|     465|
|    305|     451|
|      6|      86|
|     62|     257|
|    286|    1014|
|    200|     222|
|    210|      40|
|    224|      29|
|    303|     785|
|    122|     387|
|    194|     274|
|    291|    1042|
|    234|    1184|
+-------+--------+
only showing top 20 rows



In [160]:
ratings.filter((ratings.rating==5) & (ratings.user_id==253)).show()

+-------+--------+------+--------------+
|user_id|movie_id|rating|unix_timestamp|
+-------+--------+------+--------------+
|    253|     465|     5|     891628467|
|    253|     510|     5|     891628416|
|    253|     183|     5|     891628341|
|    253|     483|     5|     891628122|
|    253|     198|     5|     891628392|
|    253|     127|     5|     891628060|
|    253|     173|     5|     891628483|
|    253|     527|     5|     891628518|
|    253|     117|     5|     891628535|
|    253|      87|     5|     891628278|
|    253|     705|     5|     891628598|
|    253|      64|     5|     891628252|
|    253|     496|     5|     891628278|
|    253|      79|     5|     891628518|
|    253|      98|     5|     891628295|
|    253|     588|     5|     891628416|
|    253|      22|     5|     891628435|
|    253|     494|     5|     891628341|
|    253|      12|     5|     891628159|
|    253|     202|     5|     891628392|
+-------+--------+------+--------------+
only showing top

## Groupby

We can use groupby function with a spark dataframe too. Pretty much same as a pandas groupby with the exception that you will need to import `pyspark.sql.functions`


Here we have found the count of ratings and average rating from each user_id

In [165]:
from pyspark.sql import functions as F

ratings.groupBy('user_id').agg(F.count('user_id'), F.mean('rating')).show()

+-------+--------------+------------------+
|user_id|count(user_id)|       avg(rating)|
+-------+--------------+------------------+
|    148|            65|               4.0|
|    463|           133|2.8646616541353382|
|    471|            31|3.3870967741935485|
|    496|           129|3.0310077519379846|
|    833|           267| 3.056179775280899|
|    243|            81|3.6419753086419755|
|    392|           111| 4.045045045045045|
|    540|            63|3.7142857142857144|
|    623|            45|3.7333333333333334|
|    737|            33|3.9696969696969697|
|    858|            21|3.4285714285714284|
|    897|           185| 3.962162162162162|
|     31|            36|3.9166666666666665|
|    516|            21| 4.095238095238095|
|    251|            77| 3.792207792207792|
|     85|           288|3.5381944444444446|
|    137|            47| 4.319148936170213|
|    451|            98|2.7346938775510203|
|    580|            47|3.5531914893617023|
|    808|            23| 4.13043

## sort

In [166]:
ratings.sort('user_id').show()

+-------+--------+------+--------------+
|user_id|movie_id|rating|unix_timestamp|
+-------+--------+------+--------------+
|      1|     113|     5|     878542738|
|      1|     227|     4|     876892946|
|      1|      17|     3|     875073198|
|      1|      61|     4|     878542420|
|      1|      90|     4|     878542300|
|      1|      33|     4|     878542699|
|      1|      64|     5|     875072404|
|      1|     202|     5|     875072442|
|      1|      92|     3|     876892425|
|      1|     265|     4|     878542441|
|      1|     228|     5|     878543541|
|      1|     155|     2|     878542201|
|      1|     266|     1|     885345728|
|      1|      47|     4|     875072125|
|      1|     121|     4|     875071823|
|      1|      20|     4|     887431883|
|      1|     114|     5|     875072173|
|      1|     189|     3|     888732928|
|      1|     132|     4|     878542889|
|      1|     171|     5|     889751711|
+-------+--------+------+--------------+
only showing top

## Descending sort

In [171]:
from pyspark.sql import functions as F

ratings.sort(F.desc('user_id'), F.desc('rating'), 'movie_id').show()

+-------+--------+------+--------------+
|user_id|movie_id|rating|unix_timestamp|
+-------+--------+------+--------------+
|    943|       2|     5|     888639953|
|    943|      12|     5|     888639093|
|    943|      42|     5|     888639042|
|    943|      55|     5|     888639118|
|    943|      56|     5|     888639269|
|    943|      64|     5|     875409939|
|    943|      69|     5|     888639427|
|    943|      79|     5|     888639019|
|    943|      92|     5|     888639660|
|    943|      98|     5|     888638980|
|    943|     100|     5|     875501725|
|    943|     127|     5|     875501774|
|    943|     173|     5|     888638960|
|    943|     182|     5|     888639066|
|    943|     184|     5|     888639247|
|    943|     186|     5|     888639478|
|    943|     187|     5|     888639147|
|    943|     194|     5|     888639192|
|    943|     196|     5|     888639192|
|    943|     201|     5|     888639351|
+-------+--------+------+--------------+
only showing top

## Joins/Merging with Spark Dataframes

I was not able to find a pandas equivalent of merge with Spark DataFrames but we can use SQL with dataframes and thus we can merge dataframes using SQL.

Let us try to run some SQL on Ratings.

We first register the ratings df to a temporary table ratings_table on which we can run sql operations.

As you can see the result of the SQL select statement is again a Spark Dataframe.

## Define `sqlContext`

In [174]:
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

In [175]:
ratings.registerTempTable('ratings_table')
newDf = sqlContext.sql('select * from ratings_table where rating > 4')
newDf.show()

+-------+--------+------+--------------+
|user_id|movie_id|rating|unix_timestamp|
+-------+--------+------+--------------+
|    253|     465|     5|     891628467|
|    286|    1014|     5|     879781125|
|    200|     222|     5|     876042340|
|    122|     387|     5|     879270459|
|     38|      95|     5|     892430094|
|    160|     234|     5|     876861185|
|    278|     603|     5|     891295330|
|    287|     327|     5|     875333916|
|    246|     201|     5|     884921594|
|    242|    1137|     5|     879741196|
|    249|     241|     5|     879641194|
|     99|       4|     5|     886519097|
|     25|     181|     5|     885853415|
|     59|     196|     5|     888205088|
|    290|     143|     5|     880474293|
|     42|     423|     5|     881107687|
|    138|      26|     5|     879024232|
|     60|     427|     5|     883326620|
|     57|     304|     5|     883698581|
|    127|     229|     5|     884364867|
+-------+--------+------+--------------+
only showing top

In [176]:
newDf.printSchema()

root
 |-- user_id: integer (nullable = true)
 |-- movie_id: integer (nullable = true)
 |-- rating: integer (nullable = true)
 |-- unix_timestamp: integer (nullable = true)



Let us now add one more Spark Dataframe to the mix to see if we can use join using the SQL queries:


In [177]:
# get one more df to join
movies = spark.read.load('Data-ML-100k--master/ml-100k/u.item', 
                        format='csv', sep='|', inferSchema='true', header='false')
movies.show()

+---+--------------------+-----------+----+--------------------+---+---+---+---+---+----+----+----+----+----+----+----+----+----+----+----+----+----+----+
|_c0|                 _c1|        _c2| _c3|                 _c4|_c5|_c6|_c7|_c8|_c9|_c10|_c11|_c12|_c13|_c14|_c15|_c16|_c17|_c18|_c19|_c20|_c21|_c22|_c23|
+---+--------------------+-----------+----+--------------------+---+---+---+---+---+----+----+----+----+----+----+----+----+----+----+----+----+----+----+
|  1|    Toy Story (1995)|01-Jan-1995|null|http://us.imdb.co...|  0|  0|  0|  1|  1|   1|   0|   0|   0|   0|   0|   0|   0|   0|   0|   0|   0|   0|   0|
|  2|    GoldenEye (1995)|01-Jan-1995|null|http://us.imdb.co...|  0|  1|  1|  0|  0|   0|   0|   0|   0|   0|   0|   0|   0|   0|   0|   0|   1|   0|   0|
|  3|   Four Rooms (1995)|01-Jan-1995|null|http://us.imdb.co...|  0|  0|  0|  0|  0|   0|   0|   0|   0|   0|   0|   0|   0|   0|   0|   0|   1|   0|   0|
|  4|   Get Shorty (1995)|01-Jan-1995|null|http://us.imdb.co...|  0|  

In [178]:
# change column names
movies = movies.toDF(*['movie_id', 'movie_title',
                       'release_date',
                       'video_release_date','IMDb_URL','unknown','Action','Adventure',
                       'Animation','Children','Comedy','Crime','Documentary','Drama','Fantasy',
                       'Film_Noir','Horror','Musical','Mystery','Romance','Sci_Fi','Thriller','War','Western'])


In [180]:
movies.toPandas()

Unnamed: 0,movie_id,movie_title,release_date,video_release_date,IMDb_URL,unknown,Action,Adventure,Animation,Children,...,Fantasy,Film_Noir,Horror,Musical,Mystery,Romance,Sci_Fi,Thriller,War,Western
0,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
1,2,GoldenEye (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?GoldenEye%20(...,0,1,1,0,0,...,0,0,0,0,0,0,0,1,0,0
2,3,Four Rooms (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Four%20Rooms%...,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,4,Get Shorty (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Get%20Shorty%...,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,5,Copycat (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Copycat%20(1995),0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1677,1678,Mat' i syn (1997),06-Feb-1998,,http://us.imdb.com/M/title-exact?Mat%27+i+syn+...,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1678,1679,B. Monkey (1998),06-Feb-1998,,http://us.imdb.com/M/title-exact?B%2E+Monkey+(...,0,0,0,0,0,...,0,0,0,0,0,1,0,1,0,0
1679,1680,Sliding Doors (1998),01-Jan-1998,,http://us.imdb.com/Title?Sliding+Doors+(1998),0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
1680,1681,You So Crazy (1994),01-Jan-1994,,http://us.imdb.com/M/title-exact?You%20So%20Cr...,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [181]:
movies.registerTempTable('movies_table')

q = """
select ratings_table.*,
    movies_table.movie_title
from ratings_table
left join movies_table on movies_table.movie_id = ratings_table.movie_id"""
sqlContext.sql(q).show()

+-------+--------+------+--------------+--------------------+
|user_id|movie_id|rating|unix_timestamp|         movie_title|
+-------+--------+------+--------------+--------------------+
|    196|     242|     3|     881250949|        Kolya (1996)|
|    186|     302|     3|     891717742|L.A. Confidential...|
|     22|     377|     1|     878887116| Heavyweights (1994)|
|    244|      51|     2|     880606923|Legends of the Fa...|
|    166|     346|     1|     886397596| Jackie Brown (1997)|
|    298|     474|     4|     884182806|Dr. Strangelove o...|
|    115|     265|     2|     881171488|Hunt for Red Octo...|
|    253|     465|     5|     891628467|Jungle Book, The ...|
|    305|     451|     3|     886324817|       Grease (1978)|
|      6|      86|     3|     883603013|Remains of the Da...|
|     62|     257|     2|     879372434| Men in Black (1997)|
|    286|    1014|     5|     879781125|Romy and Michele'...|
|    200|     222|     5|     876042340|Star Trek: First ...|
|    210

Let us try to do what we were doing earlier with the RDDs. Finding the top 25 most rated movies:


In [183]:
q = """
select movie_id,
    movie_title,
    count(user_id) as num_ratings
from (select r.*, m.movie_title
    from ratings_table r
    left join movies_table m on m.movie_id = r.movie_id) A
group by movie_id, movie_title
order by num_ratings desc
"""

sqlContext.sql(q).show()

+--------+--------------------+-----------+
|movie_id|         movie_title|num_ratings|
+--------+--------------------+-----------+
|      50|    Star Wars (1977)|        583|
|     258|      Contact (1997)|        509|
|     100|        Fargo (1996)|        508|
|     181|Return of the Jed...|        507|
|     294|    Liar Liar (1997)|        485|
|     286|English Patient, ...|        481|
|     288|       Scream (1996)|        478|
|       1|    Toy Story (1995)|        452|
|     300|Air Force One (1997)|        431|
|     121|Independence Day ...|        429|
|     174|Raiders of the Lo...|        420|
|     127|Godfather, The (1...|        413|
|      56| Pulp Fiction (1994)|        394|
|       7|Twelve Monkeys (1...|        392|
|      98|Silence of the La...|        390|
|     237|Jerry Maguire (1996)|        384|
|     117|    Rock, The (1996)|        378|
|     172|Empire Strikes Ba...|        367|
|     222|Star Trek: First ...|        365|
|     204|Back to the Futur...| 

And finding the top 25 highest rated movies having more than 100 votes:


In [188]:
q = """
select movie_id,
    movie_title,
    avg(rating) as avg_rating,
    count(movie_id) as num_ratings
from (select r.*, m.movie_title
    from ratings_table r
    left join movies_table m on m.movie_id = r.movie_id) A
group by movie_id, movie_title
having num_ratings > 100

order by avg_rating desc

"""

high_rated_ddf = sqlContext.sql(q)
high_rated_ddf.show()

+--------+--------------------+------------------+-----------+
|movie_id|         movie_title|        avg_rating|num_ratings|
+--------+--------------------+------------------+-----------+
|     408|Close Shave, A (1...| 4.491071428571429|        112|
|     318|Schindler's List ...| 4.466442953020135|        298|
|     169|Wrong Trousers, T...| 4.466101694915254|        118|
|     483|   Casablanca (1942)|  4.45679012345679|        243|
|      64|Shawshank Redempt...| 4.445229681978798|        283|
|     603|  Rear Window (1954)|4.3875598086124405|        209|
|      12|Usual Suspects, T...| 4.385767790262173|        267|
|      50|    Star Wars (1977)|4.3584905660377355|        583|
|     178| 12 Angry Men (1957)|             4.344|        125|
|     134| Citizen Kane (1941)| 4.292929292929293|        198|
|     427|To Kill a Mocking...| 4.292237442922374|        219|
|     357|One Flew Over the...| 4.291666666666667|        264|
|      98|Silence of the La...|  4.28974358974359|     

# Display

How do I get this to work in my notebook?

# Converting from Spark Dataframe to RDD and vice versa

Sometimes you may want to convert to RDD from a spark Dataframe or vice versa so that you can have the best of both worlds.

## To convert from DF to RDD, you can simply do :

In [191]:
high_rated_ddf.rdd.take(2)

[Row(movie_id=408, movie_title='Close Shave, A (1995)', avg_rating=4.491071428571429, num_ratings=112),
 Row(movie_id=318, movie_title="Schindler's List (1993)", avg_rating=4.466442953020135, num_ratings=298)]

In [193]:
high_rated_ddf.head(2)

[Row(movie_id=408, movie_title='Close Shave, A (1995)', avg_rating=4.491071428571429, num_ratings=112),
 Row(movie_id=318, movie_title="Schindler's List (1993)", avg_rating=4.466442953020135, num_ratings=298)]

## To go from an RDD to a dataframe

In [194]:
from pyspark.sql import Row

# create an RDD first
data = [('A',1),('B',2),('C',3),('D',4)]
rdd = sc.parallelize(data)

In [195]:
rdd

ParallelCollectionRDD[498] at parallelize at PythonRDD.scala:195

In [196]:
# map the schema using Row
rdd_new = rdd.map(lambda x: Row(key=x[0], value=int(x[1])))

In [200]:
# convert the rdd to a DataFrame
rdd_as_df = sqlContext.createDataFrame(rdd_new)
rdd_as_df.show()

+---+-----+
|key|value|
+---+-----+
|  A|    1|
|  B|    2|
|  C|    3|
|  D|    4|
+---+-----+

