# Machine Learning with Spark

### This is based on a public lesson from Databricks [available here](https://github.com/databricks/spark-training/tree/master/machine-learning)

Have you ever wondered how Netflix knows what movies you will like, based just on your ratings of other movies? This is a machine learning problem called Collaborative Filtering. Spark comes with a pretty good algorithm for solving the collaborative filtering problem called Alternating Least Squares. We'll build a model of a real movie ratings data set, and then see what your predictions are for various movies!

Here are some more details:
 - The Movielens data set comes from the University of Minnesota CS department -- it was one of the first consumer recommender systems in the early 2000s. It is still online [here](http://movielens.org).
 - We won't be doing any serious cross-validation of the model, although Spark does make that relatively easy'
 - The SparkML guide is [here](http://spark.apache.org/docs/latest/ml-guide.html). The Spark Collaborative Filtering guide (very relevant) is [here](http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html)
 - An article on the ALS learning algorithm is [here](http://bugra.github.io/work/notes/2014-04-19/alternating-least-squares-method-for-collaborative-filtering/). The original research paper is [here](http://dl.acm.org/citation.cfm?id=1608614).

In [1]:
import pyspark
import pyspark.sql
import pandas, pandas.tools.plotting
import matplotlib.pyplot as plt
from pyspark.sql.functions import *

from IPython.display import display, HTML

try: sc = pyspark.SparkContext('local[*]')
except ValueError: pass
spark = pyspark.sql.SparkSession(sc)

# Useful function for displaying a DataFrame in a nice-looking way
def show(df):
   display(HTML(
    '<table><tr><th>{}</th></tr><tr>{}</tr></table>'.format(
        '</th><th>'.join(str(_) for _ in df.columns),
        '</tr><tr>'.join(
            '<td>{}</td>'.format('</td><td>'.join(str(_) for _ in row)) for row in df.take(50))
        )
     ))

In [2]:
# Functions for loading and parsing the Movielens data set

# /data/movie-ratings.dat is in this format: 
# userid::movieId::rating (1-5 scale)::timestamp (we ignore this)
# 1::1193::5::978300760
# 1::661::3::978302109
# 1::914::3::978301968
# 1::3408::4::978300275
# 1::2355::5::978824291

from pyspark.mllib.recommendation import ALS, MatrixFactorizationModel, Rating
def parseRating(line):
    # Parse a user-movie-rating triple from /data/movie-ratings.dat
    fields = line.strip().split("::")
    return (int(fields[0]), int(fields[1]), float(fields[2]))
    
def parseMovie(line):
    # Parse a movie ID and title from /data/movies.dat
    fields = line.strip().split("::")          
    return int(fields[0]), fields[1] 
    
def loadRatings():
    # load and parse the entire movie-ratings file
    f = open('../data/movie-ratings.dat', 'r').readlines()
    ratings = [parseRating(l) for l in f]
    # Skip any ratings of 0 (these are bad data)
    ratings = [Rating(l[0], l[1], l[2]) for l in ratings if l[2] > 0][1000:]
    return ratings

# Preload all the movie names into a global dict for easy lookup

movieNamesDict = {}
for movie in open('../data/movies.dat', encoding = "ISO-8859-1").readlines():
    parsed = parseMovie(movie)
    movieNamesDict[parsed[0]] = parsed[1]

def getNameOfMovie(movie):
    if movie in movieNamesDict: return movieNamesDict[movie]
    else: return 'Unknown'
    

In [3]:

rank = 20
numIterations = 15
ratingsRdd = sc.parallelize(loadRatings())
model = ALS.train(ratingsRdd, rank, numIterations, 0.01)

# Compute the Mean Squared Error so we know how good our model is
testdata = ratingsRdd.map(lambda p: (p[0], p[1]))
# Try to reconstruct all the ratings from the model
predictions = model.predictAll(testdata).map(lambda r: ((r[0], r[1]), r[2]))
# Join them with the real ratings
ratesAndPreds = ratingsRdd.map(lambda r: ((r[0], r[1]), r[2])).join(predictions)
# Compute the MSE between the real rating and the predicted rating
MSE = ratesAndPreds.map(lambda r: (r[1][0] - r[1][1])**2).mean()
print("Mean Squared Error = " + str(MSE))

----------------------------------------
Exception happened during processing of request from ('127.0.0.1', 35096)
ERROR:root:Exception while sending command.
Traceback (most recent call last):
  File "/usr/local/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1035, in send_command
    raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 883, in send_command
    response = connection.send_command(command)
  File "/usr/local/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1040, in send_command
    "Error while receiving", e, proto.ERROR_ON_RECEIVE)
py4j.protocol.Py4JNetworkError: Error while receivingTraceback (most recent call last):

  File "/opt/conda/lib/python3.6/socketserver.py", line 317, in _ha

ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:46707)
Traceback (most recent call last):
  File "/usr/local/spark/python/pyspark/rdd.py", line 809, in collect
    port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
  File "/usr/local/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/usr/local/spark/python/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/usr/local/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 327, in get_return_value
    format(target_id, ".", name))
py4j.protocol.Py4JError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 827, in _get_connection
 

Py4JNetworkError: An error occurred while trying to connect to the Java server (127.0.0.1:46707)