MSD-RecSys - NYU Big Data Final Project

Using Apache PySpark and Alternating Least Squares, we created a recommender system on the Million Song Dataset. Our paper can be found here.

GITHUB ORGANIZATION

The following files were run sequentially to obtain the final results from the ALS Model (ie. 500 recommendations per user)

Build_Hash.py : .py file that creates a uniform integer hash key for the train, test, and validation sets. This key is then saved locally on HDFS
Parquet_Build.py: .py file that loads in the uniform hash key from HDFS, applies it to each of the datasets, and then writes the new files back out to our local HDFS
GridSearch_All.py: .py file that performs grid search on the ALS model
GridSearchFinal: Folder that contains the results of our grid search and the corresponding Jupiter notebook
FinalModel.py: .py file that contains our final model run, with the optimal hyper parameters (running to a high level of iterations)

Subsample.py: .py file that subsamples from train & test user/track/count data (.5%)
Lenskit_Extension.ipynb: Extension results

Name		Name	Last commit message	Last commit date
Latest commit History 90 Commits
Figures		Figures
GridSearchFinal		GridSearchFinal
.DS_Store		.DS_Store
.gitignore		.gitignore
Build_Hash.py		Build_Hash.py
FinalModel.py		FinalModel.py
FinalReport.docx		FinalReport.docx
FinalReport.pdf		FinalReport.pdf
GS5.txt		GS5.txt
GS501.txt		GS501.txt
GS502.txt		GS502.txt
GS503.txt		GS503.txt
GridSearch.py		GridSearch.py
GridSearch_All.py		GridSearch_All.py
Lenskit_Extension.ipynb		Lenskit_Extension.ipynb
Model.py		Model.py
Parquet_Build.py		Parquet_Build.py
README.md		README.md
Subsample.py		Subsample.py
Test.py		Test.py
Test_Subsample.parquet		Test_Subsample.parquet
Train_Subsample.parquet		Train_Subsample.parquet
shell_setup.sh		shell_setup.sh