<a href="https://colab.research.google.com/github/d-vinha/SPBD/blob/main/lab1/SPBD_Labs_mapreduce1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Python MapReduce Example

Word count implemented in pure Python.

This notebook exemplifies the execution of a map-reduce program in Python, using Hadoop.
In this example, hadoop runs in standalone mode and reads data from the local filesystem, while in cluster mode data is read typically from HDFS dsitributed file system.


### Download the dataset

In [2]:
!wget -q -O os_maias.txt https://www.dropbox.com/s/n24v0z7y79np319/os_maias.txt?dl=0

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## WordCount Example
Read the words from input and count them.

The processing is split into two steps:

+ The mapper emits for each line the number of words
+ The reduces sums all the tuples produced by the mapper stage...

### Mapper

By starting an element with "%%file", you are specifying that when run, the contents are written to the local disk.

In [4]:
%%file mapper.py
#!/usr/bin/env python

# import sys
import sys
# import string library function
import string

# input comes from STDIN (standard input)
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()
    # remove punctuation characters
    line = line.translate(str.maketrans('', '', string.punctuation+'«»'))
    # split the line into words
    words = line.split()
    print('words\t%s' % len(words))

Writing mapper.py


### Reducer

In [10]:
%%file reducer.py
#!/usr/bin/env python

import sys

total_count = 0

# input comes from STDIN
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()

    # parse the input we got from mapper.py
    key, count = line.split('\t', 1)

    # convert count (currently a string) to int
    count = int(count)

    total_count += count

print('words\t%s' % (total_count))

Overwriting reducer.py


## Local execution

The scripts can be tested using just the unix shell, as follows...

### Make the scripts executable

In [6]:
!chmod a+x mapper.py && chmod a+x reducer.py

### Execute

The execution workflow is as follows:

+ The input file is piped into the input of the mapper;
+ The output the mapper is sorted;
+ The sorted output of the mapper is fed to the reducer stage.

In [11]:
!cat "os_maias.txt" | ./mapper.py | sort -k1,1 | ./reducer.py

words	213359


## MapReduce with HADOOP

In [12]:
#@title Install Hadoop on Google Colab
!curl -s https://raw.githubusercontent.com/smduarte/spbd-2324/main/lab1/install_hadoop.sh | bash

## Hadoop standalone mode execution

For executing in an hadoop cluster, input data should be moved into an HDFS directory. For executing in standalone mode, data can be read from the local filesystem.


The output directory needs to be cleared...

In [14]:
!rm -rf results

### Submitting the job

The _hadoop_ command is used to submit the mapreduce job to the cluster...

In [15]:
!hadoop jar /usr/local/hadoop-3.3.6/share/hadoop/tools/lib/hadoop-*streaming*.jar -files mapper.py,reducer.py -mapper mapper.py -reducer reducer.py -input os_maias.txt -output results

2023-09-14 10:41:21,233 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties
2023-09-14 10:41:21,461 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
2023-09-14 10:41:21,461 INFO impl.MetricsSystemImpl: JobTracker metrics system started
2023-09-14 10:41:21,489 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized!
2023-09-14 10:41:21,812 INFO mapred.FileInputFormat: Total input files to process : 1
2023-09-14 10:41:21,838 INFO mapreduce.JobSubmitter: number of splits:1
2023-09-14 10:41:22,277 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local738239175_0001
2023-09-14 10:41:22,277 INFO mapreduce.JobSubmitter: Executing with tokens: []
2023-09-14 10:41:22,665 INFO mapred.LocalDistributedCacheManager: Localized file:/content/mapper.py as file:/tmp/hadoop-root/mapred/local/job_local738239175_0001_224ccbb8-fe54-4797-89c1-60e4215cf220/mapper.py
2023-09-14 10:41:22,691 INFO mapred.LocalDistributedCacheMana

#### Checking the results
The result is stored in directory results.

In [16]:
!cat results/part-*

words	213359
