# Social Computing - Summer 2018
# Exercise 3: Collaborative Filtering Recommender System

Write a simple collaborative filtering movie recommender in python. The input (attached to this exercise) is a subset of the MovieLens dataset (movielens.org), which contains 1862 movies, and 100K ratings. Find the dataset Piazza. The input is provided in two data files: u.item: is a listing of movies. Each row represents a movie with its attributes separated by ‘|’. We are only interested in the first two attributes which are the movie ID and the movie name. The second data file is u.data, which contains the movie ratings by users. A single row represents one user’s rating for one movie, and the attributes are (from left to right): user ID, movie ID, rating, and timestamp. 

<b>Exercise</b><br>
The entry point to your recommendation engine should be a python method called “recommend” that takes a user ID, and paths to the two aforementioned data files. The method should return the top twenty recommended movies for that user <u><b>(The movies should not have been rated by that user before)</b></u>. The output should be a list of python tuples (sorted by recommended movies' expected ratings: highest first). Each tuple has the following two attributes: movie name, expected rating. You are free to design your recommendation engine the way you want but straightforward collborative filtering is highly recommended. Make sure that the code is clean, readable, and well documented. Test the code for user 15!

### The Collaborative Filtering Recommender
The idea of collaborative filtering is to find similar users to the target user. The items highly rated by those users are likely to be favorited by the target user. Therefore, the main problem is to find the list of users similar to the target user. There are several ways of measuring similarity. We could e.g. use cosine similarity but one of the simplest measures is Euclidean distance. Your program should calculate the absolute Euclidean distance between the target user and all other users in the dataset and calculate the expected rating for the target user for each movie in the dataset based on the forumla:

$$r_{ui} = \frac{\sum_{v \in N_i(u)} w_{uv}r_{vi}}{\sum_{v \in N_i(u)} w_{uv}}$$

Where $r_{ui}$ is the expected recommendation of item i for target user u. $N_i(u)$ is the set of similar users to target user u for the designated item i. $w_{uv}$ is the similarity score between users u and v (used as a weighting factor) for $r_{vi}$ which is the rating of user v for item i. 
The curious student should refer to: Ricci et al (eds.) "Recommender Systems Handbook", Springer 2011, for more details

In [1]:
import math

# Method has following parameters:
# path_u_item: path to u.item as a string
# HINT: movie_dictionary = {movie_id: movie_name}
def create_movie_dictionary(path_u_item):
    movie_dictionary = {}
    for line in open(path_u_item):
        row = line.strip().split("|")
        movie_id, movie_name = int(row[0]), row[1]
        movie_dictionary[movie_id] = movie_name
    return movie_dictionary


# Create a dictionary: {user_id: {movie_id: user_rating}}
def create_user_rating_dictionary(path_input):
    user_rating_dictionary = {}
    for line in open(path_input):
        row = line.strip().split('\t')
        user_id, movie_id, rating, timestamp = int(row[0]), int(row[1]), int(row[2]), int(row[3])
        try:
            user_rating_dictionary.setdefault(user_id, {})
            user_rating_dictionary[user_id][movie_id] = rating
        except KeyError:
            print "key error found! " + user_id + " " + movie_id
    return user_rating_dictionary


# Using Euclidean distance to calculate similarity score
def calculate_similarity_score(ratings, user_id1, user_id2):
    movies_user1 = []
    movies_user2 = []
    for movie_user1 in ratings[user_id1]:
        movies_user1.append(movie_user1)
    for movie_user2 in ratings[user_id2]:
        movies_user2.append(movie_user2)
    common_movies = [movie for movie in movies_user1 if movie in movies_user2]
            
    if len(common_movies) == 0: # no common ratings between two users. Similarity is 0
        return 0

    # TODO Calculate Euclidean distance between two users based on their common ratings
    sum_of_squares_of_differences = 0
    for movie_id in common_movies:
        diff = ratings[user_id1][movie_id] - ratings[user_id2][movie_id]
        sum_of_squares_of_differences += diff * diff
        # TODO Accumulate the sum of squares of differences in ratings between the two users for the same movie

    return 1 / (1 + math.sqrt(sum_of_squares_of_differences))


def cf_recommend(ratings, target_user_id):
    weighted_ratings = {} # {movie_id: weighted_rating}
    similarity_scores = {} # {movie_id: similarity_score}
    recommended_movie_list = [] # Each element is a tuple (estimated_rating, movie_id)
    for user_id in ratings:
        if user_id != target_user_id:
            similarity_score = calculate_similarity_score(ratings, target_user_id, user_id)
            if similarity_score > 0:
                for movie_id in ratings[user_id]:
                    # Movie was not recommended by the target user before
                    if movie_id not in ratings[target_user_id]:
                        if movie_id in weighted_ratings:
                            weighted_ratings[movie_id] += ratings[user_id][movie_id] * similarity_score
                        else:
                            weighted_ratings[movie_id] = ratings[user_id][movie_id] * similarity_score
                        if movie_id in similarity_scores:
                            similarity_scores[movie_id] += similarity_score
                        else:
                            similarity_scores[movie_id] = similarity_score
                        # TODO Accumulate the weighted rating for that movie
                        # The weighted rating of the movie = user_id's rating of that movie * similarity 
                        # score between that user and the target_user 
                        # of that user to the target_user
                        # TODO Accumulate the similarity scores of all users who rated that movie

    for movie in weighted_ratings: # TODO for each movie
        # TODO Weighted_rating/sigma (similarity scores of users who rated that movie)
        estimated_rating = weighted_ratings[movie] / similarity_scores[movie]
        recommended_movie_list.append((estimated_rating, movie))

    # TODO Sort the list
    list.sort(recommended_movie_list, reverse = True)
    return recommended_movie_list # List of recommended movies for user_id from highest to lowest estimated rating


# Testing implementation
movie_dict = create_movie_dictionary("u.item")
ratings = create_user_rating_dictionary("u.data")
recommended_movies = cf_recommend(ratings, 15)# target user_id (e.g. 15)
top_twenty = []
for estimated_rating, movie_id in recommended_movies[:20]:
    top_twenty.append((movie_dict[movie_id], estimated_rating))
print top_twenty

[('Entertaining Angels: The Dorothy Day Story (1996)', 5.0), ("Someone Else's America (1995)", 5.0), ('Aiqing wansui (1994)', 5.0), ('Santa with Muscles (1996)', 5.0), ('Saint of Fort Washington, The (1993)', 5.0), ('Star Kid (1997)', 5.0), ('Marlene Dietrich: Shadow and Light (1996) ', 5.0), ('Prefontaine (1997)', 5.0), ('They Made Me a Criminal (1939)', 5.0), ('Great Day in Harlem, A (1994)', 5.0), ('Pather Panchali (1955)', 4.707643292903408), ('Letter From Death Row, A (1998)', 4.669158388353758), ('Bitter Sugar (Azucar Amargo) (1996)', 4.596712913497055), ("Some Mother's Son (1996)", 4.578057497967401), ('Close Shave, A (1995)', 4.532323099469121), ('Anna (1996)', 4.516688444604956), ('Maya Lin: A Strong Clear Vision (1994)', 4.492758300698357), ('Everest (1998)', 4.449386037323544), ('Faust (1994)', 4.432941667793023), ("Schindler's List (1993)", 4.431900305815376)]
