## Map Reduce using Python and Hadoop

###Hadoop Installation


*   Download and unzip Hadoop files
*   Copy hadoop folder to /usr/local/
*   Set java directory as environment variable
*   Check if hadoop is installed properly (run + ls)




In [1]:
! pip install mrjob --quiet

[?25l[K     |▊                               | 10 kB 22.5 MB/s eta 0:00:01[K     |█▌                              | 20 kB 28.6 MB/s eta 0:00:01[K     |██▎                             | 30 kB 31.9 MB/s eta 0:00:01[K     |███                             | 40 kB 33.5 MB/s eta 0:00:01[K     |███▊                            | 51 kB 22.5 MB/s eta 0:00:01[K     |████▌                           | 61 kB 24.6 MB/s eta 0:00:01[K     |█████▏                          | 71 kB 25.8 MB/s eta 0:00:01[K     |██████                          | 81 kB 19.8 MB/s eta 0:00:01[K     |██████▊                         | 92 kB 21.3 MB/s eta 0:00:01[K     |███████▌                        | 102 kB 22.3 MB/s eta 0:00:01[K     |████████▏                       | 112 kB 22.3 MB/s eta 0:00:01[K     |█████████                       | 122 kB 22.3 MB/s eta 0:00:01[K     |█████████▊                      | 133 kB 22.3 MB/s eta 0:00:01[K     |██████████▍                     | 143 kB 22.3 MB/s eta 0:

In [2]:
!wget https://archive.apache.org/dist/hadoop/common/hadoop-3.2.4/hadoop-3.2.4.tar.gz --quiet

In [5]:
!tar -xzvf hadoop-3.2.4.tar.gz &> /dev/null

In [6]:
#copy  hadoop file to user/local
!cp -r hadoop-3.2.4/ /usr/local/

In [7]:
! ls /usr/lib/jvm/

default-java  java-1.11.0-openjdk-amd64  java-11-openjdk-amd64


In [8]:
import os
os.environ["JAVA_HOME"] ="/usr/lib/jvm/java-11-openjdk-amd64/" 

In [10]:
#Running Hadoop
!/usr/local/hadoop-3.2.4/bin/hadoop &> /dev/null

In [11]:
! hadoop-3.2.4/bin/hdfs dfs -ls
os.environ["hdfs"] = "hadoop-3.2.4/bin/hdfs"

Found 4 items
drwxr-xr-x   - root root       4096 2022-07-13 13:42 .config
drwxr-xr-x   - 1000 1000       4096 2022-07-12 12:42 hadoop-3.2.4
-rw-r--r--   1 root root  492368219 2022-07-22 02:06 hadoop-3.2.4.tar.gz
drwxr-xr-x   - root root       4096 2022-07-13 13:43 sample_data


### Download dataset
The Map Reduce dataset will be done on Hotel Reviews dataset which containts two columns: reviews - written as simple text and ratings - on a scale from 1 to 5.

In [12]:
from google.colab import files
uploaded = files.upload()

Saving Hotel_Reviews.csv.zip to Hotel_Reviews.csv.zip


In [13]:
!unzip Hotel_Reviews.csv.zip

Archive:  Hotel_Reviews.csv.zip
  inflating: Hotel_Reviews.csv       


Let's see how the dataset looks like:

In [14]:
import pandas as pd
df = pd.read_csv("Hotel_Reviews.csv")
df.head()

Unnamed: 0,Review,Rating
0,nice hotel expensive parking got good deal sta...,4
1,ok nothing special charge diamond member hilto...,2
2,nice rooms not 4* experience hotel monaco seat...,3
3,"unique, great stay, wonderful time hotel monac...",5
4,"great stay great stay, went seahawk game aweso...",5


In [15]:
df.shape

(20491, 2)

### Map Reduce

MapReduce is a tool for parallel processing of large amoounts of data. The Map process separates the input into key value pairs (value of 1) which can be aggregated by Reduce into the final result.

In case of Hotel Review dataset each review rating will get denoted as a key with value one (Map). Next, total number of reviews will be calculated by counting sum by rating key (Reduce).

##### Map Reduce Job using Python MRJob (locally)

In [16]:
%%writefile CountRatings.py
# !/usr/local/bin/python

from mrjob.job import MRJob
from mrjob.step import MRStep
import csv
# definition of Map Reduce Job

# in order to get key value pairs
columns = 'Review,Rating'.split(',')

class RatingCount(MRJob):

  # mapper 
  def mapper(self, _, line):
    content = csv.reader([line])
    for row in content:
      dictionary = dict(zip(columns,row))
      rating = dictionary['Rating']
      if rating != "Rating":
        yield(rating, 1)

  def reducer(self, key, counts):
    yield(key, sum(counts))

if __name__ == "__main__":
  RatingCount.run()


Writing CountRatings.py


In [17]:
! python CountRatings.py Hotel_Reviews.csv -o output

No configs found; falling back on auto-configuration
No configs specified for inline runner
Running step 1 of 1...
Creating temp directory /tmp/CountRatings.root.20220729.104909.712226
job output is in output
Removing temp directory /tmp/CountRatings.root.20220729.104909.712226...


The result of the job is placed in the output directory in few files. We can display the content of these files:

In [18]:
!ls output/

part-00000  part-00001	part-00002


In [19]:
!cat output/part-00000
!cat output/part-00001
!cat output/part-00002

"1"	1421
"2"	1793
"3"	2184
"4"	6039
"5"	9054


From the above cell we can see that the job has been successfully executed.

##### Map Reduce using Hadoop 

In order to run the job on Hadoop we can create a Dataproc cluster on Google Cloud.

All files have to be put in HDFS using:




In [None]:
%%shell
hadoop fs -put Hotel_Reviews.csv
hadoop fs -put CountRatings.py

One of the following commands can be used to run the job in Hadoop:

In [None]:
%%shell
hadoop jar /usr/lib/hadoop/hadoop-streaming.jar -files MapR/CountRatings.py -input MapR/Hotel_Reviews.csv -output new-dir

python CountRatings.py -r hadoop --hadoop-streaming-jar /usr/lib/hadoop/hadoop-streaming.jar Hotel_Reviews.csv