# CS4337 - MOVIE RECOMMENDATION PROJECT

18266401 AYOUB JDAIR


# Section 1: Gathering user data ...


All my files have been uploaded to google drive. This cell mounts the drive in order for the code to work.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Imports

In [None]:
pip install pyspark

Collecting pyspark
  Downloading pyspark-3.1.2.tar.gz (212.4 MB)
[K     |████████████████████████████████| 212.4 MB 70 kB/s 
[?25hCollecting py4j==0.10.9
  Downloading py4j-0.10.9-py2.py3-none-any.whl (198 kB)
[K     |████████████████████████████████| 198 kB 33.4 MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.1.2-py2.py3-none-any.whl size=212880768 sha256=7fba639cd84ca641fb6ebd21709c350ebde5c728c05124fe9f35a71cfa8785b8
  Stored in directory: /root/.cache/pip/wheels/a5/0a/c1/9561f6fecb759579a7d863dcd846daaa95f598744e71b02c77
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9 pyspark-3.1.2


In [None]:
import sys
import os
from time import time

import sys
import itertools
from math import sqrt
from operator import add
from os.path import join, isfile, dirname

from pyspark.sql import SparkSession
from pyspark.mllib.recommendation import ALS

Collecting user input

In [None]:
topMovies = """1,Toy Story (1995)
780,Independence Day (a.k.a. ID4) (1996)
590,Dances with Wolves (1990)
1210,Star Wars: Episode VI - Return of the Jedi (1983)
648,Mission: Impossible (1996)
344,Ace Ventura: Pet Detective (1994)
165,Die Hard: With a Vengeance (1995)
153,Batman Forever (1995)
597,Pretty Woman (1990)
1580,Men in Black (1997)
231,Dumb & Dumber (1994)"""


In [None]:
parentDir = os.path.abspath('/content/drive/MyDrive/College/CS4337/movielens/medium/ayoub')
ratingsFile = join(parentDir, "personalRatings.txt")

In [None]:
if isfile(ratingsFile):
    r = input("Looks like you've already rated the movies. Overwrite ratings (y/N)? ")
    if r and r[0].lower() == "y":
        os.remove(ratingsFile)
    else:
        sys.exit()

Looks like you've already rated the movies. Overwrite ratings (y/N)? y


In [None]:
prompt = "Please rate the following movie (1-5 (best), or 0 if not seen): "
print(prompt)


Please rate the following movie (1-5 (best), or 0 if not seen): 


In [None]:
now = int(time())
n = 0

Generating Personal Ratings file...

In [None]:
f = open(ratingsFile, 'w')
for line in topMovies.split("\n"):
    ls = line.strip().split(",")
    valid = False
    while not valid:
        rStr = input(ls[1] + ": ")
        r = int(rStr) if rStr.isdigit() else -1
        if r < 0 or r > 5:
            print(prompt)
        else:
            valid = True
            if r > 0:
                f.write("0::%s::%d::%d\n" % (ls[0], r, now))
                n += 1
f.close()

Toy Story (1995): 5
Independence Day (a.k.a. ID4) (1996): 0
Dances with Wolves (1990): 0
Star Wars: Episode VI - Return of the Jedi (1983): 4
Mission: Impossible (1996): 4
Ace Ventura: Pet Detective (1994): 5
Die Hard: With a Vengeance (1995): 4
Batman Forever (1995): 4
Pretty Woman (1990): 1
Men in Black (1997): 3
Dumb & Dumber (1994): 4


In [None]:
if n == 0:
    print("No rating provided!")


 #2: Doing maths stuff ...

Data loading/parsing functions

In [None]:
def parseRating(line):
    """
    Parses a rating record in MovieLens format userId::movieId::rating::timestamp .
    """
    fields = line.strip().split("::")
    return int(fields[3]) % 10, (int(fields[0]), int(fields[1]), float(fields[2]))


def parseMovie(line):
    """
    Parses a movie record in MovieLens format movieId::movieTitle .
    """
    fields = line.strip().split("::")
    print(fields)
    return int(fields[0]), fields[1]

def loadRatings(ratingsFile):
    """
    Load ratings from file.
    """
    if not isfile(ratingsFile):
        print("File %s does not exist." % ratingsFile)
        sys.exit(1)
    f = open(ratingsFile, 'r')
    print("Opening ratings file...")
    print("PERSONAL RATINGS FILE located at:\n",ratingsFile)
    ratings = filter(lambda r: r[2] > 0, [parseRating(line)[1] for line in f])
    f.close()
    if not ratings:
        print("No ratings provided.")
        sys.exit(1)
    else:
        print("Ratings: ", ratings)
        return ratings

Calculating the Root Mean Squared Error Value

In [None]:
def computeRmse(model, data, n):
    """
    Compute RMSE (Root Mean Squared Error).
    """
    predictions = model.predictAll(data.map(lambda x: (x[0], x[1])))
    predictionsAndRatings = predictions.map(lambda x: ((x[0], x[1]), x[2])) \
      .join(data.map(lambda x: ((x[0], x[1]), x[2]))) \
      .values()
    output = sqrt(predictionsAndRatings.map(lambda x: (x[0] - x[1]) ** 2).reduce(add) / float(n))
    print("Computing RMSE...")
    print("Result! ", output)
    return output

#Section 2: Gathering some more data ...

In [None]:
if __name__ == "__main__":

    print("--------IGNORE THIS (used to help me understand the data and code a little better)--------\n")
    # set up environment
    spark = SparkSession.builder \
   .master("local") \
   .appName("Movie Recommendation Engine") \
   .config("spark.executor.memory", "1gb") \
   .getOrCreate()
   
    sc = spark.sparkContext

    # load personal ratings
    print("Loading personal ratings file... \n")
    myRatings = loadRatings(os.path.abspath('/content/drive/MyDrive/College/CS4337/movielens/medium/ayoub/personalRatings.txt'))
    print("MyRatings: ", myRatings)
    print("Converting myRatings file to RDD... \n")
    myRatingsRDD = sc.parallelize(myRatings, 1)
    print("Converted RDD: ", myRatingsRDD)
    print("\n-------------------------------------------------------------------------------------------")

    # load ratings and movie titles
    movieLensHomeDir = os.path.abspath('/content/drive/MyDrive/College/CS4337/movielens/medium')

    # ratings is an RDD of (last digit of timestamp, (userId, movieId, rating))
    ratings = sc.textFile(join(movieLensHomeDir, "ratings.dat")).map(parseRating)

    # movies is an RDD of (movieId, movieTitle)
    movies = dict(sc.textFile(join(movieLensHomeDir, "movies.dat")).map(parseMovie).collect())


    # your code here
    # please scroll down ...


    # # clean up
    # sc.stop()



--------IGNORE THIS (used to help me understand the data and code a little better)--------

Loading personal ratings file... 

Opening ratings file...
PERSONAL RATINGS FILE located at:
 /content/drive/MyDrive/College/CS4337/movielens/medium/ayoub/personalRatings.txt
Ratings:  <filter object at 0x7f3586093a10>
MyRatings:  <filter object at 0x7f3586093a10>
Converting myRatings file to RDD... 

Converted RDD:  ParallelCollectionRDD[0] at readRDDFromFile at PythonRDD.scala:274

-------------------------------------------------------------------------------------------


#Section 3: My code here ...

#-1 Create Training data set

#-2 Create Testing data set

In [None]:
# Here I am splitting the dataset into a training set and a testing set, 60% and 40% respectively
# I will then evaluate the model on the testing set and
# look at the RMSE value to evaluate performance.

training_delta = ratings.filter(lambda x: x[0] < 6).values()
# converting piplinedRDD back to RDD
training = spark.sparkContext.parallelize(training_delta.collect())
trainingCount = training.count()


testing_delta = ratings.filter(lambda x: x[0] >= 6).values()
# converting piplinedRDD back to RDD
testing = spark.sparkContext.parallelize(testing_delta.collect())
testingCount = testing.count()

print("--------IGNORE THIS TOO--------\n")

def percentage(percent, n):
  return int((percent * n) / 100.0)

entries = 1000209
Ctraining = percentage(60, entries)
Ctesting = percentage(40, entries)
print("Training: ", trainingCount)
print("Testing:", testingCount)
print(Ctraining, Ctesting)

print("\n---------------------------------")


--------IGNORE THIS TOO--------

Training:  602241
Testing: 397968
600125 400083

---------------------------------


#-3 Train the model
-Using the ALS algorithm

-Fit traing data to model




In [None]:
%%html
<iframe pointer-events:none; scrolling="no" src="https://drive.google.com/file/d/1bTh1JLtC2u24PwfBaew8kw6XJUkKTa5K/preview" width="1000" height="450" allow="autoplay"></iframe>


<!-- Image below from : "Prototyping a Recommender System Step by Step Part 2: Alternating  -->
<!--                    Least Square (ALS) Matrix Factorization in Collaborative Filtering" -->
<!--                    - Kevin Liao -->
<!-- Link: https://towardsdatascience.com/prototyping-a-recommender-system-step-by-step-part-2-alternating-least-square-als-matrix-4a76c58714a1 -->

In [None]:
# Here I am using the ALS library imported from MLLIB in the provided code
# Since we are using MLLIB and not ML, i am calling the .train() method from ALS
#  This method takes in the data set in the form of an RDD, this is why I converted above

# The parameters used to train the model are as follows:
# 1 - "training"      : The trainng set created above
# 2 - "rank"          : Rank is used for the matrix factorisation part of ALS,
#                       it is effectively the width of the matrix "User Matrix" and "Item Matrix" in the above image
# 3 - "iterations"    : Amount of epochs, how many times we will preform the action in the image above.
# 4 - "lambda_"       : Prevents overfitting in the model
# 5 - "nonnenegative" : This is true becasue we do not want the model to return any negative predictions

# I will then call the provided "computeRmse()" and pass in:
# 1 - My model
# 2 - My testing data set
# 3 - And count() of my testing data set, declared above

model = ALS.train(training, rank=8, iterations=20, lambda_=0.07, nonnegative=True)
testingRMSE = computeRmse(model, testing, testingCount)
format_RMSE = "{:.2f}".format(testingRMSE)
print("Calculated RMSE on Model: ", format_RMSE)

# I spent a considerable amount of time tweaking "rank" , "iterations", and "lambda_"
# and found that 8, 20, and 0.07 have provided the best model

Computing RMSE...
Result!  0.8686590127072287
Calculated RMSE on Model:  0.87


-Predict using testing data


In [None]:
# Here I will be making my predictions using the predictAll() method
# I will first make a list of movies I have rated
# I will use this list to make sure my list of possible recomendations does not include
# i have already rated
# Finally, i will let you choose how many recomendations you would like
myRatedMoviesSet = set([x[1] for x in myRatings])
possibleRecommendations = sc.parallelize([m for m in movies if m not in myRatedMoviesSet])
predictions = model.predictAll(possibleRecommendations.map(lambda x: (1, x))).collect()
n = int(input("How many recommendations would you like?: "))
recommendations = sorted(predictions, key=lambda x: x[2], reverse=True)[:n]

How many recommendations would you like?: 10


#-4 Return recommendation


In [None]:
print("Here are the top %d Movies recommended for you:" % n)
for i in range(len(recommendations)):
  j = n-i
  print("Number ", j, ": ", movies[recommendations[i][1]])

Here are the top 10 Movies recommended for you:
Number  10 :  Gambler, The (A J�t�kos) (1997)
Number  9 :  Tic Code, The (1998)
Number  8 :  American Dream (1990)
Number  7 :  Visitors, The (Les Visiteurs) (1993)
Number  6 :  Anne Frank Remembered (1995)
Number  5 :  Train of Life (Train De Vie) (1998)
Number  4 :  Bewegte Mann, Der (1994)
Number  3 :  Before the Rain (Pred dozhdot) (1994)
Number  2 :  Smashing Time (1967)
Number  1 :  Aim�e & Jaguar (1999)


#-5 Report the accuracy of the model


*I spent a considerable amount of time tweaking the "rank" , "iterations", and "lambda_" paramters in ALS.train() and found that 8, 20, and 0.07 have provided the best model. This means the models predictions will off by 0.86 which I see to be an acceptable result and accuracy rate. If provided with more time this could possilbly be improved.*

#Section 4: Finishing touches ...

Closing up shop...

In [None]:
    sc.stop()
    print("----------------------------------------NOTES FOR XXXX-----------------------------------------\n")
    print("\n")
    print("--------------------------------------THANK YOU FOR READING-------------------------------------")

----------------------------------------NOTES FOR SAHIR-----------------------------------------

Because I have laid out this submission in such a way to make it easier for you to follow along,
and see my process by deviding up the cells and putting headers and text between them, 
running iindividual cells can sometimes be troublsome. 
I have extensively tested this code before submission and can guarantee it runs on my machine 
Please run from main and 'After' or all togther if that dosnt work
Please run on collab.research.google.com if possible as you are aware that I had alot 
of trouble setting up pyspark on my mac, I ended up using google collab in an effort to catch up 
because of the significant time I had lost. My files are also hosted on google Drive.
I suspect that my submittion will also run on jupyer notebook as they are both .ipynb but 
this is just an FYI that I resorted to using collab in the end.

--------------------------------------THANK YOU FOR READING-------------