## Movie Recommendation System
In this notebook, we will use an Alternating Least Squares (ALS) algorithm with Spark APIs to predict the ratings for the movies in [MovieLens Full dataset](https://grouplens.org/datasets/movielens/latest/)

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Spark Setup

In [2]:
# Spark setup
!apt-get update
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://apache.osuosl.org/spark/spark-3.1.2/spark-3.1.2-bin-hadoop3.2.tgz
!tar xf spark-3.1.2-bin-hadoop3.2.tgz

0% [Working]            Get:1 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ InRelease [3,626 B]
0% [Connecting to archive.ubuntu.com (91.189.88.152)] [Connecting to security.u0% [Connecting to archive.ubuntu.com (91.189.88.152)] [Connecting to security.u0% [1 InRelease gpgv 3,626 B] [Connecting to archive.ubuntu.com (91.189.88.152)                                                                               Ign:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease
0% [1 InRelease gpgv 3,626 B] [Connecting to archive.ubuntu.com (91.189.88.152)                                                                               Ign:3 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  InRelease
0% [1 InRelease gpgv 3,626 B] [Connecting to archive.ubuntu.com (91.189.88.152)                                                                               Hit:4 https://developer.download.nvidia.com/comp

In [3]:
# Set up Spark
!pip install -q findspark
!pip install py4j

!export JAVA_HOME=$(/usr/lib/jvm/java-8-openjdk-amd64 -v 1.8)
! echo $JAVA_HOME
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.1.2-bin-hadoop3.2"
os.environ["PYSPARK_PYTHON"] = "python3"
import findspark
findspark.init("spark-3.1.2-bin-hadoop3.2")# SPARK_HOME

from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()

Collecting py4j
  Downloading py4j-0.10.9.2-py2.py3-none-any.whl (198 kB)
[?25l[K     |█▋                              | 10 kB 26.1 MB/s eta 0:00:01[K     |███▎                            | 20 kB 26.2 MB/s eta 0:00:01[K     |█████                           | 30 kB 18.4 MB/s eta 0:00:01[K     |██████▋                         | 40 kB 16.9 MB/s eta 0:00:01[K     |████████▎                       | 51 kB 8.3 MB/s eta 0:00:01[K     |██████████                      | 61 kB 8.1 MB/s eta 0:00:01[K     |███████████▌                    | 71 kB 8.3 MB/s eta 0:00:01[K     |█████████████▏                  | 81 kB 9.3 MB/s eta 0:00:01[K     |██████████████▉                 | 92 kB 9.8 MB/s eta 0:00:01[K     |████████████████▌               | 102 kB 7.8 MB/s eta 0:00:01[K     |██████████████████▏             | 112 kB 7.8 MB/s eta 0:00:01[K     |███████████████████▉            | 122 kB 7.8 MB/s eta 0:00:01[K     |█████████████████████▍          | 133 kB 7.8 MB/s eta 0:00:01

In [4]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import math

## Part 1: Data ETL and Data Exploration

In [5]:
from pyspark.sql import SparkSession
spark = SparkSession \
  .builder \
  .appName("moive analysis") \
  .config("spark.some.config.option", "some-value") \
  .getOrCreate()

In [6]:
movies_df = spark.read.load("/content/drive/MyDrive/Documents/Projects/Movie_Recommendation_System/ml-latest/movies.csv", format='csv', header = True)
ratings_df = spark.read.load("/content/drive/MyDrive/Documents/Projects/Movie_Recommendation_System/ml-latest/ratings.csv", format='csv', header = True)
links_df = spark.read.load("/content/drive/MyDrive/Documents/Projects/Movie_Recommendation_System/ml-latest/links.csv", format='csv', header = True)
tags_df = spark.read.load("/content/drive/MyDrive/Documents/Projects/Movie_Recommendation_System/ml-latest/tags.csv", format='csv', header = True)

In [7]:
movies_df.show(5)

+-------+--------------------+--------------------+
|movieId|               title|              genres|
+-------+--------------------+--------------------+
|      1|    Toy Story (1995)|Adventure|Animati...|
|      2|      Jumanji (1995)|Adventure|Childre...|
|      3|Grumpier Old Men ...|      Comedy|Romance|
|      4|Waiting to Exhale...|Comedy|Drama|Romance|
|      5|Father of the Bri...|              Comedy|
+-------+--------------------+--------------------+
only showing top 5 rows



In [8]:
ratings_df.show(5)

+------+-------+------+----------+
|userId|movieId|rating| timestamp|
+------+-------+------+----------+
|     1|    307|   3.5|1256677221|
|     1|    481|   3.5|1256677456|
|     1|   1091|   1.5|1256677471|
|     1|   1257|   4.5|1256677460|
|     1|   1449|   4.5|1256677264|
+------+-------+------+----------+
only showing top 5 rows



In [9]:
links_df.show(5)

+-------+-------+------+
|movieId| imdbId|tmdbId|
+-------+-------+------+
|      1|0114709|   862|
|      2|0113497|  8844|
|      3|0113228| 15602|
|      4|0114885| 31357|
|      5|0113041| 11862|
+-------+-------+------+
only showing top 5 rows



In [10]:
tags_df.show(5)

+------+-------+------------+----------+
|userId|movieId|         tag| timestamp|
+------+-------+------------+----------+
|    14|    110|        epic|1443148538|
|    14|    110|    Medieval|1443148532|
|    14|    260|      sci-fi|1442169410|
|    14|    260|space action|1442169421|
|    14|    318|imdb top 250|1442615195|
+------+-------+------------+----------+
only showing top 5 rows



In [11]:
tmp1 = ratings_df.groupBy("userID").count().toPandas()['count'].min()
tmp2 = ratings_df.groupBy("movieId").count().toPandas()['count'].min()
print('For the users that rated movies and the movies that were rated:')
print('Minimum number of ratings per user is {}'.format(tmp1))
print('Minimum number of ratings per movie is {}'.format(tmp2))

For the users that rated movies and the movies that were rated:
Minimum number of ratings per user is 1
Minimum number of ratings per movie is 1


In [12]:
tmp1 = sum(ratings_df.groupBy("movieId").count().toPandas()['count'] == 1)
tmp2 = ratings_df.select('movieId').distinct().count()
print('{} out of {} movies are rated by only one user'.format(tmp1, tmp2))

10155 out of 53889 movies are rated by only one user


## Part 2: Spark SQL and OLAP EDA

In [13]:
movies_df.registerTempTable("movies")
ratings_df.registerTempTable("ratings")
links_df.registerTempTable("links")
tags_df.registerTempTable("tags")

### Number of Users

In [14]:
num_users = spark.sql("SELECT COUNT(DISTINCT userId) as num_users FROM ratings")
num_users.show()

+---------+
|num_users|
+---------+
|   283228|
+---------+



### Number of Movies

In [15]:
num_movies = spark.sql("SELECT COUNT(DISTINCT movieId) as num_movies FROM movies")
num_movies.show()

+----------+
|num_movies|
+----------+
|     58098|
+----------+



### Number of movies that are rated by users

In [16]:
rated_movies = spark.sql("SELECT COUNT(DISTINCT movieId) as rated_movies FROM ratings")
rated_movies.show()

+------------+
|rated_movies|
+------------+
|       53889|
+------------+



### Movie Genres

In [19]:
from pyspark.sql.functions import udf
from pyspark.sql.types import ArrayType, StringType
genresSplit = udf(lambda x: x.split('|'), ArrayType(StringType()))
spark.udf.register("genresSplit", genresSplit)
movie_genres = spark.sql("SELECT DISTINCT EXPLODE(genresSplit(genres)) as genres FROM movies ORDER BY 1")
movie_genres.show()

+------------------+
|            genres|
+------------------+
|(no genres listed)|
|            Action|
|         Adventure|
|         Animation|
|          Children|
|            Comedy|
|             Crime|
|       Documentary|
|             Drama|
|           Fantasy|
|         Film-Noir|
|            Horror|
|              IMAX|
|           Musical|
|           Mystery|
|           Romance|
|            Sci-Fi|
|          Thriller|
|               War|
|           Western|
+------------------+



### Movie count by genres

In [20]:
category_count = spark.sql("SELECT genres, COUNT(movieId) as count FROM (SELECT EXPLODE(genresSplit(genres)) as genres, movieId FROM movies) GROUP BY 1 ORDER BY 2 DESC")
category_count.show()

+------------------+-----+
|            genres|count|
+------------------+-----+
|             Drama|24144|
|            Comedy|15956|
|          Thriller| 8216|
|           Romance| 7412|
|            Action| 7130|
|            Horror| 5555|
|       Documentary| 5118|
|             Crime| 5105|
|(no genres listed)| 4266|
|         Adventure| 4067|
|            Sci-Fi| 3444|
|           Mystery| 2773|
|          Children| 2749|
|         Animation| 2663|
|           Fantasy| 2637|
|               War| 1820|
|           Western| 1378|
|           Musical| 1113|
|         Film-Noir|  364|
|              IMAX|  197|
+------------------+-----+



### Movies by genres

In [21]:
category_movies = spark.sql("SELECT genres, concat_ws(',', collect_list(title)) as movies FROM (SELECT EXPLODE(genresSplit(genres)) as genres, title FROM movies) GROUP BY genres")
category_movies.show()

+------------------+--------------------+
|            genres|              movies|
+------------------+--------------------+
|             Crime|Heat (1995),Casin...|
|           Romance|Grumpier Old Men ...|
|          Thriller|Heat (1995),Golde...|
|         Adventure|Toy Story (1995),...|
|             Drama|Waiting to Exhale...|
|               War|Richard III (1995...|
|       Documentary|Across the Sea of...|
|           Fantasy|Toy Story (1995),...|
|           Mystery|Copycat (1995),Ci...|
|           Musical|Pocahontas (1995)...|
|         Animation|Toy Story (1995),...|
|         Film-Noir|Devil in a Blue D...|
|(no genres listed)|Away with Words (...|
|              IMAX|Wings of Courage ...|
|            Horror|Dracula: Dead and...|
|           Western|Desperado (1995),...|
|            Comedy|Toy Story (1995),...|
|          Children|Toy Story (1995),...|
|            Action|Heat (1995),Sudde...|
|            Sci-Fi|Powder (1995),Cit...|
+------------------+--------------

## Part 3: Spark ALS based Recommendation System
We will use an Spark ML to predict the ratings, so let's reload "ratings.csv" and convert it to the form of (user, item, rating) tuples.

In [22]:
ratings_df.show()

+------+-------+------+----------+
|userId|movieId|rating| timestamp|
+------+-------+------+----------+
|     1|    307|   3.5|1256677221|
|     1|    481|   3.5|1256677456|
|     1|   1091|   1.5|1256677471|
|     1|   1257|   4.5|1256677460|
|     1|   1449|   4.5|1256677264|
|     1|   1590|   2.5|1256677236|
|     1|   1591|   1.5|1256677475|
|     1|   2134|   4.5|1256677464|
|     1|   2478|   4.0|1256677239|
|     1|   2840|   3.0|1256677500|
|     1|   2986|   2.5|1256677496|
|     1|   3020|   4.0|1256677260|
|     1|   3424|   4.5|1256677444|
|     1|   3698|   3.5|1256677243|
|     1|   3826|   2.0|1256677210|
|     1|   3893|   3.5|1256677486|
|     2|    170|   3.5|1192913581|
|     2|    849|   3.5|1192913537|
|     2|   1186|   3.5|1192913611|
|     2|   1235|   3.0|1192913585|
+------+-------+------+----------+
only showing top 20 rows



In [23]:
movie_ratings=ratings_df.drop('timestamp')

In [24]:
# Data type convert
from pyspark.sql.types import IntegerType, FloatType
movie_ratings = movie_ratings.withColumn("userId", movie_ratings["userId"].cast(IntegerType()))
movie_ratings = movie_ratings.withColumn("movieId", movie_ratings["movieId"].cast(IntegerType()))
movie_ratings = movie_ratings.withColumn("rating", movie_ratings["rating"].cast(FloatType()))

In [25]:
movie_ratings.show()

+------+-------+------+
|userId|movieId|rating|
+------+-------+------+
|     1|    307|   3.5|
|     1|    481|   3.5|
|     1|   1091|   1.5|
|     1|   1257|   4.5|
|     1|   1449|   4.5|
|     1|   1590|   2.5|
|     1|   1591|   1.5|
|     1|   2134|   4.5|
|     1|   2478|   4.0|
|     1|   2840|   3.0|
|     1|   2986|   2.5|
|     1|   3020|   4.0|
|     1|   3424|   4.5|
|     1|   3698|   3.5|
|     1|   3826|   2.0|
|     1|   3893|   3.5|
|     2|    170|   3.5|
|     2|    849|   3.5|
|     2|   1186|   3.5|
|     2|   1235|   3.0|
+------+-------+------+
only showing top 20 rows



### ALS Model Selection and Evaluation

With the ALS model, we can use a grid search to find the optimal hyperparameters.

In [26]:
# import package
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS
from pyspark.ml.tuning import CrossValidator,ParamGridBuilder

In [27]:
#Create test and train set
(training,test)=movie_ratings.randomSplit([0.8,0.2])

In [28]:
#Create ALS model
als = ALS(maxIter=5, rank=10, regParam=0.01, userCol="userId", itemCol="movieId", ratingCol="rating", coldStartStrategy="drop")

In [29]:
#Tune model using ParamGridBuilder
param_grid = ParamGridBuilder()\
  .addGrid(als.maxIter, [3, 5, 10, 15])\
  .addGrid(als.rank, [5, 10, 15, 20])\
  .addGrid(als.regParam, [2, 1, 0.5, 0.1])\
  .build()

In [30]:
# Define evaluator as RMSE
evaluator = RegressionEvaluator(metricName="rmse", labelCol="rating",
                                predictionCol="prediction")

In [31]:
# Build Cross validation 
cv = CrossValidator(estimator=als, estimatorParamMaps=param_grid, evaluator=evaluator, numFolds=5)

In [32]:
#Fit ALS model to training data
cv_model = cv.fit(training)

In [33]:
#Extract best model from the tuning exercise using ParamGridBuilder
best_model = cv_model.bestModel

### Model testing
And finally, make a prediction and check the testing error.

In [34]:
#Generate predictions and evaluate using RMSE
predictions = best_model.transform(test)
rmse = evaluator.evaluate(predictions)

In [35]:
#Print evaluation metrics and model parameters
best_params = cv_model.getEstimatorParamMaps()[np.argmin(cv_model.avgMetrics)]
print("RMSE = "+str(rmse))
print("**Best Model Parameters**")

for i, j in best_params.items():
  print(" " + i.name + ":", j)

RMSE = 0.8110907787176517
**Best Model Parameters**
 maxIter: 15
 rank: 20
 regParam: 0.1


In [36]:
predictions.show()

+------+-------+------+----------+
|userId|movieId|rating|prediction|
+------+-------+------+----------+
|107339|    148|   4.0| 3.3707488|
|253535|    148|   4.0|  2.981288|
| 50155|    148|   3.0|  3.053047|
|207939|    148|   3.0| 2.5791159|
|220572|    148|   2.0| 2.8528523|
|244192|    148|   3.0| 2.5753422|
|102642|    148|   4.0| 3.2726376|
| 52772|    148|   3.0|  3.342264|
|209436|    148|   2.0| 1.8003827|
|175332|    148|   4.0| 2.9685838|
| 74196|    148|   2.0| 3.0998247|
|269499|    148|   5.0| 3.2116663|
|243931|    148|   4.0| 2.8293238|
| 68789|    148|   2.0| 2.4070406|
| 87619|    148|   3.0| 2.8859024|
|276648|    148|   4.0|  2.930108|
| 17570|    148|   5.0| 2.2642663|
| 49815|    148|   3.5| 2.9580467|
|107574|    148|   1.0| 3.1151443|
| 92452|    148|   2.0| 2.3767016|
+------+-------+------+----------+
only showing top 20 rows



### Model Performance

In [37]:
alldata = best_model.transform(movie_ratings)
rmse = evaluator.evaluate(alldata)
print ("RMSE = "+str(rmse))

RMSE = 0.7671422814601507


In [38]:
alldata.registerTempTable("alldata")

In [39]:
spark.sql("SELECT* FROM alldata").show()

+------+-------+------+----------+
|userId|movieId|rating|prediction|
+------+-------+------+----------+
|107339|    148|   4.0| 3.3707488|
| 93112|    148|   3.0| 2.8507807|
|106148|    148|   2.5| 2.8564768|
|234926|    148|   4.0| 2.8931537|
|253535|    148|   4.0|  2.981288|
| 50155|    148|   3.0|  3.053047|
| 65991|    148|   4.0| 2.9118414|
|146376|    148|   5.0| 3.6848505|
|207939|    148|   3.0| 2.5791159|
| 41788|    148|   3.0| 2.8617446|
|220572|    148|   2.0| 2.8528523|
|244192|    148|   3.0| 2.5753422|
|273242|    148|   4.0| 3.3036916|
| 52620|    148|   1.0| 2.7925816|
| 98426|    148|   3.0| 2.5198703|
|102642|    148|   4.0| 3.2726376|
|108082|    148|   3.0| 2.9049718|
|264081|    148|   3.0|  2.865911|
| 60382|    148|   4.0|  3.418328|
|275860|    148|   3.0| 2.7914755|
+------+-------+------+----------+
only showing top 20 rows



In [40]:
spark.sql("SELECT* FROM movies JOIN alldata ON movies.movieId = alldata.movieId").show()

+-------+--------------------+------+------+-------+------+----------+
|movieId|               title|genres|userId|movieId|rating|prediction|
+-------+--------------------+------+------+-------+------+----------+
|    148|Awfully Big Adven...| Drama|107339|    148|   4.0| 3.3707488|
|    148|Awfully Big Adven...| Drama| 93112|    148|   3.0| 2.8507807|
|    148|Awfully Big Adven...| Drama|106148|    148|   2.5| 2.8564768|
|    148|Awfully Big Adven...| Drama|234926|    148|   4.0| 2.8931537|
|    148|Awfully Big Adven...| Drama|253535|    148|   4.0|  2.981288|
|    148|Awfully Big Adven...| Drama| 50155|    148|   3.0|  3.053047|
|    148|Awfully Big Adven...| Drama| 65991|    148|   4.0| 2.9118414|
|    148|Awfully Big Adven...| Drama|146376|    148|   5.0| 3.6848505|
|    148|Awfully Big Adven...| Drama|207939|    148|   3.0| 2.5791159|
|    148|Awfully Big Adven...| Drama| 41788|    148|   3.0| 2.8617446|
|    148|Awfully Big Adven...| Drama|220572|    148|   2.0| 2.8528523|
|    1

## Part 4: Model Application

### Recommend movies to users with id: 575, 232

In [41]:
!pip install koalas
import databricks.koalas as ks

Collecting koalas
  Downloading koalas-1.8.1-py3-none-any.whl (1.4 MB)
[?25l[K     |▎                               | 10 kB 25.7 MB/s eta 0:00:01[K     |▌                               | 20 kB 16.4 MB/s eta 0:00:01[K     |▊                               | 30 kB 15.6 MB/s eta 0:00:01[K     |█                               | 40 kB 13.6 MB/s eta 0:00:01[K     |█▏                              | 51 kB 6.6 MB/s eta 0:00:01[K     |█▍                              | 61 kB 7.6 MB/s eta 0:00:01[K     |█▊                              | 71 kB 8.2 MB/s eta 0:00:01[K     |██                              | 81 kB 7.9 MB/s eta 0:00:01[K     |██▏                             | 92 kB 6.5 MB/s eta 0:00:01[K     |██▍                             | 102 kB 7.2 MB/s eta 0:00:01[K     |██▋                             | 112 kB 7.2 MB/s eta 0:00:01[K     |██▉                             | 122 kB 7.2 MB/s eta 0:00:01[K     |███▏                            | 133 kB 7.2 MB/s eta 0:00:01[K 



In [42]:
# Convert movies_df using koalas
movies_ks = movies_df.to_koalas()

In [43]:
# function to recommend 10 movies to a given user

def topKRecommendation(k, id, model):
  '''
  k: number of recommendations
  id: user id
  model: the trained model
  '''
  # top 10 recommendations for all users
  all_recs = best_model.recommendForAllUsers(k)
  all_recs_ks = all_recs.to_koalas()
  movies_ks = movies_df.to_koalas()
  user_recs = all_recs_ks.loc[id, 'recommendations']
  recs = []
  for i in user_recs:
    recs.append(i[0])
  return movies_ks[movies_ks['movieId'].isin(recs)]

In [44]:
topKRecommendation(10, 575, best_model)

Unnamed: 0,movieId,title,genres
22267,106115,"Story of Science, The (2010)",Documentary
25620,117352,A Kind of America 2 (2008),Comedy
36262,144202,Catch That Girl (2002),Action|Children
45993,166812,Seeing Red: Stories of American Communists (1983),(no genres listed)
50847,177209,Acı Aşk (2009),Drama
53874,183947,NOFX Backstage Passport 2,(no genres listed)
54452,185203,Finding Joe (2011),Documentary
54470,185241,The Sting (1992),Comedy
55344,187125,Head Above Water (1993),Comedy|Thriller
56883,190707,1968 (2018),(no genres listed)


In [45]:
topKRecommendation(10, 232, best_model)

Unnamed: 0,movieId,title,genres
22267,106115,"Story of Science, The (2010)",Documentary
25620,117352,A Kind of America 2 (2008),Comedy
36262,144202,Catch That Girl (2002),Action|Children
45993,166812,Seeing Red: Stories of American Communists (1983),(no genres listed)
50847,177209,Acı Aşk (2009),Drama
53874,183947,NOFX Backstage Passport 2,(no genres listed)
54790,185959,Wajib (2017),Drama
55732,187947,Finger of God (2007),Documentary
55733,187949,Furious Love (2010),Documentary
56883,190707,1968 (2018),(no genres listed)


### Find the similar movies for movie with id: 463, 471


In [46]:
item_factors=best_model.itemFactors.to_koalas()

In [47]:
def similarMovies(movieId, matrix = 'cosine_similarity'):
  '''
  id: movie id
  matrix: distance calcluation method
  '''
  try:
    movie_factors = item_factors.loc[item_factors.id==str(movieId),'features'].to_numpy()[0]
  except:
    return "There is no movie with the given id."
  if matrix == "cosine_similarity":
    similar_movies = pd.DataFrame(columns = ('movieId', 'cosine_similarity'))
    for id, factors in item_factors.to_numpy():
      cos_sim = np.dot(movie_factors, factors)/(np.linalg.norm(movie_factors)*np.linalg.norm(factors))
      similar_movies = similar_movies.append({'movieId': str(id), 'cosine_similarity': cos_sim}, ignore_index = True)
    cos_sim_movie = similar_movies.sort_values(by=['cosine_similarity'],ascending = False)[1:11]
    output = cos_sim_movie.merge(movies_ks.to_pandas(), left_on='movieId', right_on = 'movieId', how = 'inner')
  if matrix=='euclidean_distance':
    similar_movies = pd.DataFrame(columns=('movieId','euclidean_distance'))
    for id, factors in item_factors.to_numpy():
      euc_dis = np.linalg.norm(np.array(movie_factors)-np.array(factors))
      similar_movies = similar_movies.append({'movieId': str(id), 'euclidean_distance': euc_dis}, ignore_index=True)
    euc_dis_movie = similar_movies.sort_values(by=['euclidean_distance'])[1:11]
    output = euc_dis_movie.merge(movies_ks.to_pandas(), left_on='movieId', right_on = 'movieId', how = 'inner')
  return output[['movieId','title','genres']]

In [48]:
similarMovies(463)

Unnamed: 0,movieId,title,genres
0,554,Trial by Jury (1994),Crime|Drama|Thriller
1,4885,Domestic Disturbance (2001),Thriller
2,2741,No Mercy (1986),Action|Crime|Thriller
3,117330,The Mark of the Angels - Miserere (2013),Thriller
4,1422,Murder at 1600 (1997),Crime|Drama|Mystery|Thriller
5,225,Disclosure (1994),Drama|Thriller
6,8720,"Super, The (1991)",Comedy
7,544,Striking Distance (1993),Action|Crime
8,1661,Switchback (1997),Crime|Mystery|Thriller
9,360,I Love Trouble (1994),Action|Comedy


In [49]:
similarMovies(471, 'cosine_similarity')

Unnamed: 0,movieId,title,genres
0,1243,Rosencrantz and Guildenstern Are Dead (1990),Comedy|Drama
1,4467,"Adventures of Baron Munchausen, The (1988)",Adventure|Comedy|Fantasy
2,6062,Lost in La Mancha (2002),Documentary
3,4027,"O Brother, Where Art Thou? (2000)",Adventure|Comedy|Crime
4,44788,This Film Is Not Yet Rated (2006),Documentary
5,4036,Shadow of the Vampire (2000),Drama|Horror
6,3108,"Fisher King, The (1991)",Comedy|Drama|Fantasy|Romance
7,59141,Son of Rambow (2007),Children|Comedy|Drama
8,151695,The Survivalist (2015),Drama|Sci-Fi|Thriller
9,6296,"Mighty Wind, A (2003)",Comedy|Musical


In [50]:
similarMovies(471, 'euclidean_distance')

Unnamed: 0,movieId,title,genres
0,4467,"Adventures of Baron Munchausen, The (1988)",Adventure|Comedy|Fantasy
1,6062,Lost in La Mancha (2002),Documentary
2,4027,"O Brother, Where Art Thou? (2000)",Adventure|Comedy|Crime
3,1243,Rosencrantz and Guildenstern Are Dead (1990),Comedy|Drama
4,44788,This Film Is Not Yet Rated (2006),Documentary
5,3108,"Fisher King, The (1991)",Comedy|Drama|Fantasy|Romance
6,59141,Son of Rambow (2007),Children|Comedy|Drama
7,4036,Shadow of the Vampire (2000),Drama|Horror
8,6296,"Mighty Wind, A (2003)",Comedy|Musical
9,151695,The Survivalist (2015),Drama|Sci-Fi|Thriller
