Learn Hadoop Using Docker

This is just for learning intention, not recomended for production. Use Hortonworks or Cloudera for production instead or you can just setup using cloud service. In GCP there is Dataproc or in AWS there is EMR (Elastic Map Reduce).

Download binary

This Dockerfile build from existing hadoop image but, if you want to download the binary first instead build your own Dockerfile image : https://archive.apache.org/dist/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz
Download Apache Spark Binary https://downloads.apache.org/spark/spark-3.3.6/

Kick-off Cluster

Clone this repos to your project directory, and cd hadoop-docker
If you are in Linux/MAC Just simply run ./run_cluster.sh
If you are in Windows try run this command docker build -t hadoop-base:3.3.6 . && docker-compose up

How to run MapReduce Job

There is ratings_breakdown.py python file in map_reduce directory, we can run this file on a local python mode or in Hadoop world
For python mode local run command python3 map_reduce/ratings_breakdown.py input/u.data
To run this file in Hadoop run this command python3 map_reduce/ratings_breakdown.py -r hadoop --hadoop-streaming-jar /opt/hadoop-3.3.6/share/hadoop/tools/lib/hadoop-streaming-3.3.6.jar input/u.data
Path /opt/hadoop-3.3.6/share/hadoop/tools/lib/hadoop-streaming-3.3.6.jar might be different for users or if you are using different version of Hadoop, run command find / hadoop-streaming-$(HADOOP_VERSION).jar to find it

Running with hadoop command

Make sure you have input directory in HDFS, if it's not exist just run this command hadoop fs -mkdir -p input
Put your data in input directory in your local project hdfs dfs -put ./input/* input
And ready to run

hadoop jar /opt/hadoop-3.3.6/share/hadoop/tools/lib/hadoop-streaming-3.3.6.jar -file /hadoop-data/map_reduce/word_count/mapper.py -mapper "python3 mapper.py" -file /hadoop-data/map_reduce/word_count/reducer.py -reducer "python3 reducer.py" -input input/words.txt -output output_word_count

Hadoop Configurations

core-site.xml default and description here
hdfs-site.xml default and description here
mapred-site.xml default and description here
yarn-site.xml default and description here
To calculate YARN and MapReduce memory configuration here

End to end step by step to kick-off Airflow, Spark Cluster and Hadoop Cluster

Please start everything from run_cluster.sh because the base image created from this step

To Do

Add Hive configuration files
- hive-site.xml
- beeline-log4j2.properties
- hive-exec-log4j2.properties
- hive-log4j2.properties
- llap-daemon-log4j2.properties
- ivysettings.xml
- hive-env.sh

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
airflow		airflow
conf		conf
dags		dags
data		data
datanode		datanode
historyserver		historyserver
map_reduce		map_reduce
namenode		namenode
nodemanager		nodemanager
resourcemanager		resourcemanager
spark		spark
src		src
.gitignore		.gitignore
Dockerfile-airflow		Dockerfile-airflow
Dockerfile-hadoop		Dockerfile-hadoop
Readme.md		Readme.md
docker-compose-airflow-hadoop.yml		docker-compose-airflow-hadoop.yml
docker-compose-airflow.yml		docker-compose-airflow.yml
docker-compose-hadoop.yml		docker-compose-hadoop.yml
entrypoint.sh		entrypoint.sh
hadoop.env		hadoop.env
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run_cluster.sh		run_cluster.sh
stop_cluster.sh		stop_cluster.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Learn Hadoop Using Docker

Download binary

Kick-off Cluster

How to run MapReduce Job

Running with hadoop command

Hadoop Configurations

End to end step by step to kick-off Airflow, Spark Cluster and Hadoop Cluster

To Do

About

Releases

Packages

Contributors 3

Languages

ayyoubmaul/hadoop-docker

Folders and files

Latest commit

History

Repository files navigation

Learn Hadoop Using Docker

Download binary

Kick-off Cluster

How to run MapReduce Job

Running with hadoop command

Hadoop Configurations

End to end step by step to kick-off Airflow, Spark Cluster and Hadoop Cluster

To Do

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages