This is just for learning intention, not recomended for production. Use Hortonworks or Cloudera for production instead or you can just setup using cloud service. In GCP there is Dataproc or in AWS there is EMR (Elastic Map Reduce).
- This Dockerfile build from existing hadoop image but, if you want to download the binary first instead build your own Dockerfile image : https://archive.apache.org/dist/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz
- Download Apache Spark Binary https://downloads.apache.org/spark/spark-3.3.6/
- Clone this repos to your project directory, and
cd hadoop-docker
- If you are in Linux/MAC Just simply run
./run_cluster.sh
- If you are in Windows try run this command
docker build -t hadoop-base:3.3.6 . && docker-compose up
- There is
ratings_breakdown.py
python file inmap_reduce
directory, we can run this file on a local python mode or in Hadoop world - For python mode local run command
python3 map_reduce/ratings_breakdown.py input/u.data
- To run this file in Hadoop run this command
python3 map_reduce/ratings_breakdown.py -r hadoop --hadoop-streaming-jar /opt/hadoop-3.3.6/share/hadoop/tools/lib/hadoop-streaming-3.3.6.jar input/u.data
- Path
/opt/hadoop-3.3.6/share/hadoop/tools/lib/hadoop-streaming-3.3.6.jar
might be different for users or if you are using different version of Hadoop, run commandfind / hadoop-streaming-$(HADOOP_VERSION).jar
to find it
- Make sure you have
input
directory in HDFS, if it's not exist just run this commandhadoop fs -mkdir -p input
- Put your data in
input
directory in your local projecthdfs dfs -put ./input/* input
- And ready to run
hadoop jar /opt/hadoop-3.3.6/share/hadoop/tools/lib/hadoop-streaming-3.3.6.jar -file /hadoop-data/map_reduce/word_count/mapper.py -mapper "python3 mapper.py" -file /hadoop-data/map_reduce/word_count/reducer.py -reducer "python3 reducer.py" -input input/words.txt -output output_word_count
- core-site.xml default and description here
- hdfs-site.xml default and description here
- mapred-site.xml default and description here
- yarn-site.xml default and description here
- To calculate YARN and MapReduce memory configuration here
Please start everything from run_cluster.sh
because the base image created from this step
- Add Hive configuration files
hive-site.xml
beeline-log4j2.properties
hive-exec-log4j2.properties
hive-log4j2.properties
llap-daemon-log4j2.properties
ivysettings.xml
hive-env.sh