Skip to content

defaultdino/big-data-project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Big Data Project

Project Description

This project is a part of the Big Data course at the Kristianstad University. The goal of the project is to create a data pipeline that can be used to analyze the data from the May 2015 Reddit Comments Dataset available on Kaggle. The data pipeline is created using Apache Spark and Hadoop and the data is stored in HDFS as Parquet files. The data pipeline is run in a Docker cluster.

Pre-requisites

  • Docker
  • Docker Compose
  • Python 3.11.6
  • Pipenv

How to run

  1. Clone the repository

  2. Follow the instructions in this repository in order to accurately setup a Hadoop/Spark cluster.

  3. Navigate to the project root

  4. Enter the jupyter-notebook directory and run the following command to start the Jupyter Notebook server:

docker-compose up -d

The Jupyter notebook server is now running on port 8888, without token authentication.

Authors

About

Big data project for course in Big Data Analytics at HKR

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors