Skip to content

fixcer/bigdata

Repository files navigation

BigData

Learn Big Data through its Python (PySpark) API by running the Jupyter notebooks with examples on how to read, process and write data.

Contents

Quick Start

Cluster overview

Application URL
Hadoop localhost:9870
MapReduce localhost:8089
HUE localhost:8088
Mongo Cluster localhost:27017
Kafka Manager localhost:9000
JupyterLab localhost:8888
Spark Master localhost:8080

Prerequisites

Build from Docker Hub

  1. Download the source code or clone the repository
  2. Build the cluster
docker-compose up -d
./config.sh
  1. Remove the cluster by typing
docker-compose down

Tech

Hadoop

Apache Spark Standalone Cluster

Mongo Sharded Cluster

WARNING (Windows & OS X)

The default Docker setup on Windows and OS X uses a VirtualBox VM to host the Docker daemon. Unfortunately, the mechanism VirtualBox uses to share folders between the host system and the Docker container is not compatible with the memory mapped files used by MongoDB (see vbox bug, docs.mongodb.org and related jira.mongodb.org bug). This means that it is not possible to run a MongoDB container with the data directory mapped to the host.

– Docker Hub (source here or here)

Mongo Components
  • Config Server (3 member replica set): configsvr01,configsvr02,configsvr03
  • 3 Shards (each a 3 member PSS replica set):
    • shard01-a,shard01-b, shard01-c
    • shard02-a,shard02-b, shard02-c
    • shard03-a,shard03-b, shard03-c
  • 2 Routers (mongos): router01, router02

References