This project is a part of the Big Data course at the Kristianstad University. The goal of the project is to create a data pipeline that can be used to analyze the data from the May 2015 Reddit Comments Dataset available on Kaggle. The data pipeline is created using Apache Spark and Hadoop and the data is stored in HDFS as Parquet files. The data pipeline is run in a Docker cluster.
- Docker
- Docker Compose
- Python 3.11.6
- Pipenv
-
Clone the repository
-
Follow the instructions in this repository in order to accurately setup a Hadoop/Spark cluster.
-
Navigate to the project root
-
Enter the jupyter-notebook directory and run the following command to start the Jupyter Notebook server:
docker-compose up -dThe Jupyter notebook server is now running on port 8888, without token authentication.