Apache Spark

The aim was to develop a console-based application in Python and Apache Spark (PySpark) in order to analyse the MovieLens dataset, which contains 27,000,000 ratings and 1,100,000 tag applications applied to 58,000 movies by 280,000 users.

The solution enables extensive querying of the dataset. User clustering as well as a recommender engine constitute its more advanced features. Upon starting the console-based application, you will be presented with a self-explanatory menu list. A Flask web app was implemented in addition and can be used to visualise some of the data.

Compiling and Running Instructions

Navigate into the spark directory:

cd spark

Add the MovieLens ml-latest and ml-latest-small datasets to the directory.

Set up a virtual environment within the directory:

python -m venv my_env

Activate the virtual environment:

source my_env/bin/activate

Install the requirements to your virtual environment via pip:

pip install -r requirements.txt

Execute:

export FLASK_APP=flask_app.py

To run the CLI version:

python main_menu.py

To run the web app:

flask run

You will be presented with a URL to open the web app in your browser.

Shared with the kind permission of collaborator ST.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
res		res
templates		templates
.gitignore		.gitignore
README.md		README.md
data_manipulation.py		data_manipulation.py
flask_app.py		flask_app.py
main_menu.py		main_menu.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Apache Spark

Compiling and Running Instructions

About

Releases

Packages

Languages

buchacher/spark

Folders and files

Latest commit

History

Repository files navigation

Apache Spark

Compiling and Running Instructions

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages