<a href="https://colab.research.google.com/github/gevargas/bigdata-management/blob/master/Intro_Hadoop.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to Apache Hadoop with Colab

# Configuration 

* Hadoop

In [None]:
# download hadoop 3.3.0
!wget https://downloads.apache.org/hadoop/common/hadoop-3.3.0/hadoop-3.3.0.tar.gz

In [None]:
# uncompress
!tar -xzf hadoop-3.3.0.tar.gz
!ls

* `JAVA_HOME` path

In [None]:
# find default colab java path 
java_home = !readlink -f /usr/bin/java | sed "s:bin/java::"   # return a list of size 1
java_home = java_home[0]

# set JAVA_HOME
%env JAVA_HOME={java_home}

## Extras

In [None]:
!rm -r sample_data/           # remove default sample_data folder

# Runnning Hadoop

In [None]:
!hadoop-3.3.0/bin/hadoop --help

# Example 1: Wordcount

* Prepare the input files

In [None]:
# copy hadoop configuration xml files to use as input
!mkdir input1/
!cp hadoop-3.3.0/etc/hadoop/*.xml  input1/
!ls input1

* Count the number of times `allowed[.]*` appears in the input

In [None]:
# use on of the mapreduce examples
# input:    path containing the text files to use as input
# output:   path to store the number of words
# grep_exp: regular expresion to use to filter the lines in the input files

!hadoop-3.3.0/bin/hadoop jar hadoop-3.3.0/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.0.jar \
        grep  \
        input1 \
        output1 \
        'allowed[.]*'


* see the results

In [None]:
!ls output/

In [None]:
!cat output/part-*

# Example 2: Wordcount using python map & reduce functions

* Collect the dataset

In [None]:
# 20,000 newsgroup documents partitioned (nearly) evenly across 20 different newsgroups
# see http://qwone.com/~jason/20Newsgroups/ for more info

!wget http://qwone.com/~jason/20Newsgroups/20news-18828.tar.gz
!tar -xzf 20news-18828.tar.gz

* Mapper function

In [None]:
%%writefile mapper.py 

import sys
import io
import re
import nltk
nltk.download('stopwords',quiet=True)

from nltk.corpus import stopwords
punctuations = '''!()-[]{};:'"\,<>./?@#$%^&*_~'''

stop_words = set(stopwords.words('english'))
input_stream = io.TextIOWrapper(sys.stdin.buffer, encoding='latin1')
for line in input_stream:
    line = line.strip()
    line = re.sub(r'[^\w\s]', '',line)
    line = line.lower()
    for x in line:
        if x in punctuations:
            line=line.replace(x, " ") 

    words=line.split()
    for word in words: 
        if word not in stop_words:
            print('%s\t%s' % (word, 1))


* Reducer function

In [None]:
%%writefile reducer.py 

from operator import itemgetter
import sys

current_word = None
current_count = 0
word = None

# input comes from STDIN
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()
    line=line.lower()

    # parse the input we got from mapper.py
    word, count = line.split('\t', 1)
    try:
      count = int(count)
    except ValueError:
      #count was not a number, so silently
      #ignore/discard this line
      continue

    # this IF-switch only works because Hadoop sorts map output
    # by key (here: word) before it is passed to the reducer
    if current_word == word:
        current_count += count
    else:
        if current_word:
            # write result to STDOUT
            print ('%s\t%s' % (current_word, current_count))
        current_count = count
        current_word = word

# do not forget to output the last word if needed!
if current_word == word:
    print( '%s\t%s' % (current_word, current_count))




* Make map/reduce files executables for Hadoop 

In [None]:
!chmod u+rwx mapper.py
!chmod u+rwx reducer.py

* Start map reduce using the python files

In [None]:
!hadoop-3.3.0/bin/hadoop \
    jar hadoop-3.3.0/share/hadoop/tools/lib/hadoop-streaming-3.3.0.jar \
    -input 20news-18828/alt.atheism/49960 \
    -output output2    \
    -file mapper.py   \
    -file reducer.py  \
    -mapper 'python mapper.py' \
    -reducer 'python reducer.py'

* Verify result

In [None]:
!ls output2

In [None]:
!cat output2/part-00000

# Example 3: mrjob (python library)

* Install mrjob

In [None]:
!pip install mrjob

* Evaluate how much time it takes to execute map reduce functions without using hadoop

In [None]:
%%timeit -n 1 -r 3
!cat 20news-18828/alt.atheism/49960 | python mapper.py | sort | python reducer.py

* Count frequent words with mrjob

In [None]:
%%timeit -n 1 -r 3
!python /usr/local/lib/python3.7/dist-packages/mrjob/examples/mr_word_freq_count.py 20news-18828/alt.atheism/49960
