# Big Data Processing Coursework





In this short notebook, we will load and explore the movielens dataset. Specifically, this notebook covers:

Loading data in memory
Creating SQLContext
Creating Spark DataFrame
Group data by columns
Operating on columns
Running SQL Queries from a Spark DataFrame
Loading in a DataFrame

Build a recommendation system which uses transactional data linking a user and an item to get a list of items to recommend to the user. There are 2 approaches:
## Collaborative Filtering
In the MovieLens dataset, we have movies previously rated by a user, and we attempt to identify if users who previously behaved similarly, ie liked/ disliked similar movies in the past, will have similar behaviors in the future. 
## Content based Filtering


Importing External files/Libraries¶

In [None]:
#!/usr/bin/python
from __future__ import print_function 

import findspark
findspark.init()

from pyspark import SparkConf, SparkContext
import sys
import re
import random

#import numpy
from math import sqrt
from movielensfcn import parseMovies, removeDuplicates, itemItem


sc = SparkContext(appName = "MovieLens").getOrCreate()

#sc.addPyFile("similarity.py")
sc.addPyFile("movielensfcn.py")

#from similarity import cosine_similarity, jaccard_similarity

# Similarity Functions

In [6]:
def cosine_similarity(ratingPairs):

    numPairs = 0
    sum_xx = sum_yy = sum_xy = 0

    for ratingX, ratingY in ratingPairs:

        sum_xx += ratingX * ratingX
        sum_yy += ratingY * ratingY
        sum_xy += ratingX * ratingY
        numPairs += 1

    numerator = sum_xy
    denominator = sqrt(sum_xx) * sqrt(sum_yy)

    score = 0
    if (denominator):
        score = ((float(numerator)) / (float(denominator)))

    return (score, numPairs)

def jaccard_similarity(ratingPairs):
 #   "The Jaccard similarity coefficient is a commonly used indicator of the similarity between two sets. For sets A and B it is defined to be the ratio of the number of elements of their intersection and the number of elements of their union If A and B are both empty, we define Jaccard_Similarity(A,B) = 1."

    numPairs = 0
    intersect_xy=setX=setY={}
    for ratingX, ratingY in ratingPairs:
        setX =set(ratingX).union(setX)
        setY =set(ratingY).union(setY)
        intersect_xy = setX.intersect(setY)
        numPairs += 1

    numerator = intersect_xy
    denominator = len(setX) + len(setY) - len(intersectXandY)

    score = 0
    if (denominator):
        score = ((float(numerator)) / (float(denominator)))

    return (score, numPairs)

First, let's get the data that we will working with in this notebook. We are using two files from the MovieLens dataset

In [4]:
ratings_file = "/data/movie-ratings/ratings.dat"
movies_file = "/data/movie-ratings/movies.dat"

In [None]:
#!wget --quiet 

In [None]:
#!unzip -q -o -d /data/movie-ratings/

In [10]:
#!ls -1 /data/movie-ratings

ls: cannot access /data/movie-ratings: No such file or directory


In [11]:
ratings_data = sc.textFile(ratings_file)
movies_data = sc.textFile(movies_file)

Inspect the files to see what we are dealing with

In [None]:
print(ratings_data.take(5))

## dat files
We notice that the columns are separated by ::
We are also told that the field is in the following format:
user::movie::rating::timestamp

## csv files
We notice that the columns are separated by ,
We are also told that the field is in the following format:
user,movie,rating,timestamp

In [16]:
print movies_data.take(5)

[u'1::Toy Story (1995)::Adventure|Animation|Children|Comedy|Fantasy', u'2::Jumanji (1995)::Adventure|Children|Fantasy', u'3::Grumpier Old Men (1995)::Comedy|Romance', u'4::Waiting to Exhale (1995)::Comedy|Drama|Romance', u'5::Father of the Bride Part II (1995)::Comedy']


We notice that the columns are separated by ::
We are also told and can validate that the field is in the following format:
movie::titleandyear::genre
There is no header file

## csv files
We notice that the columns are separated by ,
We are also told and can validate that the field is in the following format:
movie,titleandyear,genre
There is also a header file

In [None]:
numPartitions =1000

## Header File

In [8]:
if (ratings_file.find('.dat')):
	movies= movies_data.map(lambda line: re.split(r'::',line)).map(lambda x: (int(x[0]),(x[1],x[2])))
	ratings = ratings_data.map(lambda line: re.split(r'::',line)).map(lambda x: (int(x[0]),(int(x[1]),float(x[2])))).partitionBy(100)
else:
	ratings_header = ratings_data.take(1)[0]
	movies_header = movies_data.take(1)[0]
	movies= movies_data.filter(lambda line: line!=movies_header).map(lambda line: re.split(r',',line)).map(lambda x: (int(x[0]),(x[1],x[2])))
	ratings = ratings_data.filter(lambda line: line!=ratings_header).map(lambda line: re.split(r',',line)).map(lambda x: (int(x[0]),(int(x[1]),float(x[2])))).partitionBy(100)


How many movies do we have in the movies file?

In [17]:
numMovies = ratings.values().map(lambda line: line[1]).count()

How many users have rated our movies, ?

In [None]:
numMovies = ratings.values().map(lambda line: line[0]).count()

In [None]:
threshold = float(0.97)
topN= int(50)


In [None]:
print '{0}, {1}, {2}, {3}, {4} {5} {6}'.format(ratings_file, movies_file, movie_id, threshold, topN, minOccurence, algorithm)

Joining RDDs
Create RDDs for the same ratings and the movies files.

In [None]:
user_ratings_data = ratings.join(ratings)


Remove a rating if a user gives the same value for the same movie

In [None]:
unique_joined_ratings = user_ratings_data.filter(removeDuplicates)


Map RDDs

In [None]:
movie_pairs = unique_joined_ratings.map(itemItem).partitionBy(numPartitions)


Now group all ratings together for the same movie

In [None]:
movie_pairs_ratings= movie_pairs.groupByKey()

In [9]:
if algorithm == "JACCARD" :
	item_item_similarities = movie_pairs_ratings.mapValues(jaccard_similarity).persist()
elif algorithm == "COSINE" :
	item_item_similarities = movie_pairs_ratings.mapValues(cosine_similarity).persist()
else:
	item_item_similarities = movie_pairs_ratings.mapValues(cosine_similarity).persist()




KeyboardInterrupt: 

In [None]:
item_item_sorted=item_item_similarities.sortByKey()

In [None]:
item_item_sorted.persist()

# Filter for movies with this sim that are "good" as defined by
# our quality thresholds above
filteredResults = item_item_sorted.filter(lambda((item_pair,similarity_occurence)): \
        (item_pair[0] == movie_id or item_pair[1] == movie_id) \
        and similarity_occurence[0] > threshold and similarity_occurence[1] > minOccurence)

if (topN==0):
    topN=10

results = filteredResults.map(lambda((x,y)): (y,x)).sortByKey(ascending = False)
resultsTopN = results.take(topN)
results.coalese(1).saveAsTextFile("movielens")




The join function combines two datasets (Key,ValueV) and (Key,ValueW) together to get (Key, (ValueV,ValueW)).  Let's join the movie and ratings file together to get meaningful recommendations

In [None]:

   print "Top 10 similar movies for " + nameDict[movieID]
   for result in resultsTopN:
       (sim, pair) = result
#         Display the similarity result that isn't the movie we're looking at
       similarMovieID = pair[0]
       if (similarMovieID == movieID):
           similarMovieID = pair[1]
       print nameDict[similarMovieID] + "\tscore: " + str(sim[0]) + "\tstrength: " + str(sim[1])
	

In [None]:
sc.stop()