## Spark in Action with Example
### Opening & inspecting the files

Let us work with a concrete example which takes care of some usual transformations.

We will work on Movielens ml-100k.zip dataset which is a stable benchmark dataset. 100,000 ratings from 1000 users on 1700 movies. Released 4/1998.

The Movielens dataset contains a lot of files but we are going to be working with 3 files only:

1. **Users**: This file name is kept as “u.user”, The columns in this file are:

`['user_id', 'age', 'sex', 'occupation', 'zip_code']`

2. **Ratings**: This file name is kept as “u.data”, The columns in this file are:

`['user_id', 'movie_id', 'rating', 'unix_timestamp']`

3. **Movies**: This file name is kept as “u.item”, The columns in this file are:

`['movie_id', 'title', 'release_date', 'video_release_date', 'imdb_url', and 18 more columns.....]`

Our business partner now comes to us and asks us to find out:
* **the 25 most rated movie titles from this data.** 
* **How many times a movie has been rated?**

In [3]:
!pwd;ls -l

/Users/facradri/Dropbox/Tech/apps/Python/PySpark/pySparkTutorial
total 8
-rw-r--r--@ 1 facradri  LL\Domain Users  1873 Dec 31 16:04 02_MoviesRDDs.ipynb
drwxr-xr-x@ 6 facradri  LL\Domain Users   192 Dec 30 15:45 [34mData-ML-100k--master[m[m


In [5]:
# Distribute the data - Create a RDD 
users = sc.textFile("Data-ML-100k--master/ml-100k/u.user")
users.take(3)

['1|24|M|technician|85711', '2|53|F|other|94043', '3|23|M|writer|32067']

In [8]:
userRDD = sc.textFile("Data-ML-100k--master/ml-100k/u.user") 
ratingRDD = sc.textFile("Data-ML-100k--master/ml-100k/u.data") 
movieRDD = sc.textFile("Data-ML-100k--master/ml-100k/u.item") 
print("userRDD:",userRDD.take(1))
print("ratingRDD:",ratingRDD.take(1))
print("movieRDD:",movieRDD.take(1))

userRDD: ['1|24|M|technician|85711']
ratingRDD: ['196\t242\t3\t881250949']
movieRDD: ['1|Toy Story (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)|0|0|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0']


In [9]:
# Create a RDD from RatingRDD that only contains the two columns of interest i.e. movie_id,rating.
RDD_movid_rating = ratingRDD.map(lambda x : (x.split("\t")[1],x.split("\t")[2]))
print("RDD_movid_rating:",RDD_movid_rating.take(4))

RDD_movid_rating: [('242', '3'), ('302', '3'), ('377', '1'), ('51', '2')]


In [10]:
# Create a RDD from MovieRDD that only contains the two columns of interest i.e. movie_id,title.
RDD_movid_title = movieRDD.map(lambda x : (x.split("|")[0],x.split("|")[1]))
print("RDD_movid_title:",RDD_movid_title.take(2))

RDD_movid_title: [('1', 'Toy Story (1995)'), ('2', 'GoldenEye (1995)')]


In [11]:
# merge these two pair RDDs based on movie_id. For this we will use the transformation leftOuterJoin(). See the transformation document.
rdd_movid_title_rating = RDD_movid_rating.leftOuterJoin(RDD_movid_title)
print("rdd_movid_title_rating:",rdd_movid_title_rating.take(1))

rdd_movid_title_rating: [('346', ('1', 'Jackie Brown (1997)'))]


In [12]:
# use the RDD in previous step to create (movie,1) tuple pair RDD
rdd_title_rating = rdd_movid_title_rating.map(lambda x: (x[1][1],1 ))
print("rdd_title_rating:",rdd_title_rating.take(2))

rdd_title_rating: [('Jackie Brown (1997)', 1), ('Jackie Brown (1997)', 1)]


In [13]:
# Use the reduceByKey transformation to reduce on the basis of movie_title
rdd_title_ratingcnt = rdd_title_rating.reduceByKey(lambda x,y: x+y)
print("rdd_title_ratingcnt:",rdd_title_ratingcnt.take(2))

rdd_title_ratingcnt: [('Jackie Brown (1997)', 126), ('Jungle Book, The (1994)', 85)]


In [16]:
# Get the final answer by using takeOrdered Transformation
print ("#####################################")
# print ("25 most rated movies:",rdd_title_ratingcnt.takeOrdered(25,lambda x:-x[1]))
print ("#####################################")

#####################################
#####################################
