Skip to content

bnak/Recommendations_Engine

Repository files navigation

Recommendations Engine

This is an implementation of a neighbor-based matrix decomposition recommendations engine using Yelp's mrjob package to run the program on Amazon's Elastic MapReduce.

Quickstart

Install Python packages Yelp's mrjob package and NumPy.

build_matrix.py will read in a directory, build the initial ratings matrix, and create predicted ratings that can be output to the terminal.

mr_job_recommendations.py will load:

dictionary of movie ratings - (Key: (movie_id, user_id), Value: rating)

dictionary of movies and users who have rated each movie, used to build neighborhoods - (Key: movie_id, Value: list of users who have rated that movie)

The map function places movies into neighborhoods and the reduce function produced predicted ratings through matrix decomposition. Yelp's mrjob package has great documentation and can be configured to Hadoop and Amazon EMR jobs.

Within the Python files, you can also set the following inputs to affect the accuracy of the algorithm, and the impact of these parameters are discussed in more detail below:

    K     : the number of latent features
    iterations : the maximum number of iterations toward optimization
    alpha : the learning rate (rate at which vectors in P and Q are incremented toward existing ratings)
    beta  : the regularization parameter (prevents overfitting) 

File Manifest

build_matrix.py - functions that read in data files, produce neighborhoods, performs matrix decomposition, and creates predictive ratings matrix.

mr_job_recommendations.py - program that runs the matrix decomposition as a MapReduce job using Yelp's MRjob package

build_test_data.py - program that builds files in the same format as the Netflix data to produce practice datasets

netflix_local_data - Reduced datasets in the same format as the data provided by Netflix. The first line of every file is the movie_id, and each line thereafter is in the format "user_id,rating,date"

movie_ratings_dictionary_NLD, movies_and_users_who_rated_NLD - Dictionaries from the local data that cna be loaded in the mr_job_recommendations.py file to run the MapReduce job locally

The Netflix Prize

In 2006, Netflix announced an open competition for the best collaborative filtering algorithm to predict user ratings for films based solely on previous ratings. Netflix provided a dataset of 480,189 users, 17,770 movies, and 100,480,507 ratings.

How does one calculate predicted ratings for each (movie, user) pair from a very sparse seed dataset?

The Solution

Neighbor-based Matrix Decomposition

alt text

alt text

Two factor matrices are generated and the dot-product of the vectors from the factor matrices are checked against the existing ratings and tweaked until the error between the dot-products is very small. More information the Incremental Matrix Factorization technique can be found here

alt text

Once the factor matrices are built, the product results in the predicted ratings matrix.

MapReduce

Because of the size of the dataset, one can distribute computing on a Hadoop cluster. I divided the dataset into neighborhoods and used Yelp's mrjob package and Amazon's Elastic MapReduce.

alt text

Movies were divided into neighborhoods by the number of common users who had rated those movies in the mapping step. The reducing step performs the matrix decomposition on the neighborhood.

alt text

References

Sources

Matrix Factorization Techniques for Recommender Systems

PyCon 2011: mrjob: Distributed Computing for Everyone

Matrix Factorization: A Simple Tutorial and Implementation in Python

Using Linear Algebra for Intelligent Information Retrieval

Incremental Matrix Factorization for Collaborative Filtering

[Large-scale Parallel Collaborative Filtering for the Netflix Prize] (http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf)

Incremental Singular Value Decomposition Algorithms for Highly Scaleable Recommender Systems

Instructors and Mentors

[Dr. J] (http://www.amazon.com/Probing-supersymmetry-cosmogenic-neutrinos-Uscinski/dp/1244944467)

Meghan Hade

Christian's Musings

Cynthia Dueltgen

Nick Avgerinos

Anusha Rajan

Daniel Malmer

Gowri Grewal

About

Matrix Decomposition Solution for Netflix Prize

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages