Using Apache PySpark and Alternating Least Squares, we created a recommender system on the Million Song Dataset. Our paper can be found here.
The following files were run sequentially to obtain the final results from the ALS Model (ie. 500 recommendations per user)
-
Build_Hash.py : .py file that creates a uniform integer hash key for the train, test, and validation sets. This key is then saved locally on HDFS
-
Parquet_Build.py: .py file that loads in the uniform hash key from HDFS, applies it to each of the datasets, and then writes the new files back out to our local HDFS
-
GridSearch_All.py: .py file that performs grid search on the ALS model
-
GridSearchFinal: Folder that contains the results of our grid search and the corresponding Jupiter notebook
-
FinalModel.py: .py file that contains our final model run, with the optimal hyper parameters (running to a high level of iterations)
-
Subsample.py: .py file that subsamples from train & test user/track/count data (.5%)
-
Lenskit_Extension.ipynb: Extension results