Skip to content

foscraft/hadoop-cluster

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 

Repository files navigation

Big Data Analytics Environment with Hadoop, Hive, Pig, Hue, and Jupyter

Overview

This project provides a complete big data development environment using Docker Compose. It includes:

  • Hadoop (HDFS & YARN) for distributed storage and processing
  • Hive for SQL-like querying on large datasets
  • Pig for data flow scripts and transformations
  • Hue as a web-based GUI to interact with Hadoop, Hive, and Pig
  • Jupyter Notebook for running data analysis and machine learning pipelines in Python

It is ideal for data scientists, engineers, and students looking to explore big data technologies in a local and containerized setup.


Components

Hadoop (HDFS & YARN)

Hadoop is the foundation of the ecosystem, providing distributed storage and processing capabilities. HDFS (Hadoop Distributed File System) enables scalable and fault-tolerant data storage across nodes.

Hive

Hive enables querying and managing large datasets residing in HDFS using a SQL-like language (HiveQL). It connects with a Metastore backed by PostgreSQL for schema storage.

Pig

Pig is a high-level platform for creating MapReduce programs using a scripting language called Pig Latin. It’s particularly useful for data transformations and ETL tasks.

Hue

Hue is an open-source analytics workbench for querying and visualizing data. It integrates with Hive and Pig, providing a user-friendly GUI for writing queries and managing data.

Jupyter Notebook

Jupyter provides an interactive Python environment where users can access Hadoop and Hive using libraries like PyHive and interact with data using familiar tools such as pandas and matplotlib.


Features

  • Full local simulation of a big data ecosystem
  • Interactive Jupyter notebooks for development and experimentation
  • Web-based access to Hive, HDFS, and Pig via Hue
  • Persistent volumes to retain data between container restarts
  • Easily extensible to include Spark, Airflow, Superset, or other components

Use Cases

  • Learning and practicing big data tools
  • Prototyping data pipelines in a simulated Hadoop cluster
  • Building and testing data science models that rely on distributed data
  • Teaching Hadoop ecosystem components in classroom or workshop settings

Getting Started

  1. Start the containers using Docker Compose.
  2. Access HDFS via the web UI to upload or view files.
  3. Use Hue to write and run Hive or Pig queries.
  4. Launch Jupyter to run Python-based analytics.
  5. Store all notebooks in the notebooks folder.

Access Points

  • Hue GUI: Available at http://localhost:8888
  • Jupyter Notebook: Available at http://localhost:8889
  • HDFS UI: Available at http://localhost:9870
  • HiveServer2 (JDBC): Port 10000 (for external connections)

Requirements

  • Docker
  • Docker Compose
  • At least 6 GB of RAM available for Docker

Next Steps

  • Add Spark integration for distributed computing
  • Connect Superset for data visualization
  • Incorporate Airflow for workflow orchestration
  • Enable secure access via HTTPS and authentication layers

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published