<a href="https://colab.research.google.com/github/alechacon99/INST767-SP2023/blob/main/A2_Alejandro_Chacon.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment 2

In this assignment, you are asked to calculate the word co-occurrence of the Shakespeare collection and check its correctness using a non-Hadoop approach. Specifically, we ask you to implement the **pairs** design pattern.

We have prepared some test cases to help you verify the correctness of your code. Correctly implemented mapper and reducer functions should produce results as indicated in the test cells:
1. How many pairs of words cooccurred at least 10 times?
2. How many times does the word 'macbeth' and 'lady' cooccur?
3. Which pair of words have the largest cooccurrence?

We have prepared the code to help you check your answer. However, we also ask you to **calculate the word co-occurrence without using Hadoop**. Could you answer the previous three questions with a non-Hadoop solution?

Some specific requirements are as follows:
1. Your mapper should output cooccurrence pairs as `(word1, word2)\t[NUM]`. This is because the Hadoop use `\t` to split key and values.
2. You should output *unique* cooccurrence of different words for each line. That is, a line containing four word "A A B C" should yield 6 cooccurrence pairs -- (A,B),(B,A),(A,C),(C,A),(B,C),(C,B).
3. We are only interested in pairs of words that cooccur at least 10 times.

The CoLab notebook is organized as follows:
1. In the first section **Installing Hadoop**, we provide you the basic code to install Hadoop. This is the same code we used during the classroom demo.
We made a couple of changes compared with Week 1:
2. In the second section **Shakespeare Word Count**, we download the dataset and provide you the basic mapper and reducer for the word count example demonstrated in the classroom. This allows you to verify the successful installation and basic running of hadoop.
3. In **Word Co-occurrence as Pairs**, we ask you to complete the mapper and reducer function and run them with the Hadoop command.
4. In **Checking your answer**, we prepared codes to check your results.
5. Finally, in **Calculate Co-occurrence without Hadoop?**, we ask you to write your own code to calculate the cooccurrence and answer the same three questions.

**Grades:** This assignment accounts for 10 points. 
- 4 points for implementing the mapper and reducer correctly.
- 3 points for using the Hadoop output to answer the three questions correctly. - 3 points for a seperate implementation without using non-Hadoop.

**Submission:** Please download your notebook and submit them to ELMS. We will make a minimal effort to fix trivial issues with your code and deduct points, but will not spend time debugging your code. It is your responsibility to make sure your code runs in CoLab. If anything is unclear, it is your responsibility to seek clarification.

## Installing Hadoop

In [1]:
!wget https://downloads.apache.org/hadoop/common/hadoop-3.3.4/hadoop-3.3.4.tar.gz

--2023-02-17 17:35:53--  https://downloads.apache.org/hadoop/common/hadoop-3.3.4/hadoop-3.3.4.tar.gz
Resolving downloads.apache.org (downloads.apache.org)... 88.99.95.219, 135.181.214.104, 2a01:4f9:3a:2c57::2, ...
Connecting to downloads.apache.org (downloads.apache.org)|88.99.95.219|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 695457782 (663M) [application/x-gzip]
Saving to: ‘hadoop-3.3.4.tar.gz’


2023-02-17 17:36:15 (30.4 MB/s) - ‘hadoop-3.3.4.tar.gz’ saved [695457782/695457782]



In [2]:
!tar -xzf hadoop-3.3.4.tar.gz

In [3]:
!cp -r hadoop-3.3.4/ /usr/local/

In [4]:
!readlink -f /usr/bin/java | sed "s:bin/java::"

/usr/lib/jvm/java-11-openjdk-amd64/


In [5]:
#Importing os module
import os
#Creating environment variables
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64/"

In [6]:
!mkdir ~/input
!cp /usr/local/hadoop-3.3.4/etc/hadoop/*.xml ~/input

## Shakespeare Word Count

We made a couple of changes compared with Week 1:
1. We write a new tokenizer for Mapper, which removes unnecessary punctuations.
2. We filter out rare words (words that appears less than 10 times) in reducer.

In [3]:
# Download the Shakespear collection
!wget https://www.gutenberg.org/cache/epub/100/pg100.txt

--2023-02-17 23:01:50--  https://www.gutenberg.org/cache/epub/100/pg100.txt
Resolving www.gutenberg.org (www.gutenberg.org)... 152.19.134.47, 2610:28:3090:3000:0:bad:cafe:47
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5732001 (5.5M) [text/plain]
Saving to: ‘pg100.txt’


2023-02-17 23:01:51 (10.7 MB/s) - ‘pg100.txt’ saved [5732001/5732001]



In [4]:
!mv pg100.txt /content/shakespeare.txt

In [9]:
mapperpy = '''#!/usr/bin/env python
"""mapper.py"""
import sys
import re
# input comes from STDIN (standard input)

PAT = re.compile("(^[^a-z]+|[^a-z]+$)")

def tokenize(line):
    #tokenize the string, remove unnecessary characters
    tokens = [re.sub(PAT, "", t.lower()) for t in line.split()]
    #returning non-empty strings as tokenization result
    return list(filter(lambda x:len(x)>0, tokens))

for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()
    # tokenize the line into words
    tokens = tokenize(line)
    # increase counters
    for t in tokens:
        # write the results to STDOUT (standard output);
        # what we output here will be the input for the
        # Reduce step, i.e. the input for reducer.py
        #
        # tab-delimited; the trivial word count is 1
        print('%s\t%s' % (t, 1))
'''

with open("mapper.py", "w") as fmap:
  fmap.write(mapperpy)

!chmod +x mapper.py

In [10]:
reducerpy = '''#!/usr/bin/env python
"""reducer.py"""
from operator import itemgetter
import sys
current_word = None
current_count = 0
word = None
# input comes from STDIN
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()
    # parse the input we got from mapper.py
    word, count = line.split('\t', 1)
    # convert count (currently a string) to int
    try:
        count = int(count)
    except ValueError:
        # count was not a number, so silently
        # ignore/discard this line
        continue
    # this IF-switch only works because Hadoop sorts map output
    # by key (here: word) before it is passed to the reducer
    if current_word == word:
        current_count += count
    else:
        if current_word:
            # write result to STDOUT
            if current_count >= 10:
                print('%s\t%s' % (current_word, current_count))
        current_count = count
        current_word = word
# do not forget to output the last word if needed!
if current_word == word:
    if current_count >= 10:
        print('%s\t%s' % (current_word, current_count))
'''

with open("reducer.py", "w") as freduce:
  freduce.write(reducerpy)

!chmod +x reducer.py

In [None]:
!rm -r /content/output

!/usr/local/hadoop-3.3.4/bin/hadoop jar /usr/local/hadoop-3.3.4/share/hadoop/tools/lib/hadoop-streaming-3.3.4.jar \
  -input /content/shakespeare.txt \
  -output /content/output \
  -numReduceTasks 10 \
  -mapper "python /content/mapper.py" \
  -reducer "python /content/reducer.py"

## Word Co-occurrence as Pairs
Below we provide a skeleton mapper and reducer for you. Please complete the mapper and reducer functions to calculate the word co-occurrence as pairs. 

In [79]:
pairs_mapper = '''#!/usr/bin/env python
"""mapper.py"""
import sys
import re
# input comes from STDIN (standard input)

PAT = re.compile("(^[^a-z]+|[^a-z]+$)")

def tokenize(line):
    #tokenize the string, remove unnecessary characters
    tokens = [re.sub(PAT, "", t.lower()) for t in line.split()]
    #returning non-empty strings as tokenization result
    return list(filter(lambda x:len(x)>0, tokens))

for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()
    # tokenize the line into words
    tokens = tokenize(line)
    # increase counters
    t_counter = []
    for t in tokens:
      #next for loop needs to iterate through every token in tokens except t
      if t in t_counter:
        continue
      else:
        t_counter.append(t)
        v_counter = []
        for v in tokens:
          if v == t:
            continue
          elif v in v_counter:
            continue
        # write the results to STDOUT (standard output);
        # what we output here will be the input for the
        # Reduce step, i.e. the input for reducer.py
        #
        # tab-delimited; the trivial word count is 1
          else:
            v_counter.append(v)
            print('%s\t%s' % (('('+t+', '+v+')'), 1))

'''

with open("pairs_mapper.py", "w") as fmap:
  fmap.write(pairs_mapper)

!chmod +x pairs_mapper.py

In [84]:
pairs_reducer = '''#!/usr/bin/env python
"""reducer.py"""
from operator import itemgetter
import sys

current_pair = None
current_count = 0
pair = None
# input comes from STDIN
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()
    # parse the input we got from mapper.py
    pair, count = line.split('\t', 1)
    # convert count (currently a string) to int
    try:
        count = int(count)
    except ValueError:
        # count was not a number, so silently
        # ignore/discard this line
        continue
    # this IF-switch only works because Hadoop sorts map output
    # by key (here: pair) before it is passed to the reducer
    if current_pair == pair:
        current_count += count
    else:
        if current_pair:
            # write result to STDOUT
            if current_count >= 10:
              print('%s\t%s' % (current_pair, current_count))
        current_count = count
        current_pair = pair
# do not forget to output the last pair if needed!
if current_pair == pair:
  if current_count >= 10:
    print('%s\t%s' % (current_pair, current_count))

'''

with open("pairs_reducer.py", "w") as freduce:
  freduce.write(pairs_reducer)

!chmod +x pairs_reducer.py

In [None]:
!rm -r /content/pairs_output

!/usr/local/hadoop-3.3.4/bin/hadoop jar /usr/local/hadoop-3.3.4/share/hadoop/tools/lib/hadoop-streaming-3.3.4.jar \
  -input /content/shakespeare.txt \
  -output /content/pairs_output \
  -numReduceTasks 10 \
  -mapper "python /content/pairs_mapper.py" \
  -reducer "python /content/pairs_reducer.py"

## Checking your answer

In [86]:
# The following code checks how many co-occurrence pairs there are, that is, 
# the total number of lines in the output files. 
# The correct answer should be 78498
!cat /content/pairs_output/* | wc

  78498  235494 1248414


In [87]:
# The following code checks how many times the word 'macbeth' co-occurred with
# the word 'lady'.
# The correct answer should be 73
!grep "(macbeth, lady)" /content/pairs_output/*

/content/pairs_output/part-00006:(macbeth, lady)	73


In [88]:
# The following code checks which pair of words co-occurred the most times.
# The correct answer should be 'the' and 'of'.
!cat /content/pairs_output/* | sort -k 3 -g -r |head -2

(the, of)	7914
(of, the)	7914


## Calculate Co-occurrence without Hadoop?

Below, please redo the co-occurrence calculation without using Hadoop. You may use whatever method that you are familiar with. Do they 

In [17]:
file = open('/content/shakespeare.txt', 'r')


In [18]:
import re

PAT = re.compile("(^[^a-z]+|[^a-z]+$)")

def tokenize(line):
    #tokenize the string, remove unnecessary characters
    tokens = [re.sub(PAT, "", t.lower()) for t in line.split()]
    #returning non-empty strings as tokenization result
    return list(filter(lambda x:len(x)>0, tokens))

mapper_output = []

for line in file:
    # remove leading and trailing whitespace
    line = line.strip()
    # tokenize the line into words
    tokens = tokenize(line)
    # increase counters
    t_counter = []
    for t in tokens:
      #next for loop needs to iterate through every token in tokens except t
      if t in t_counter:
        continue
      else:
        t_counter.append(t)
        v_counter = []
        for v in tokens:
          if v == t:
            continue
          elif v in v_counter:
            continue
          else:
            v_counter.append(v)
            mapper_output.append('('+t+', '+v+')')

In [19]:
len(mapper_output)

6672860

In [49]:
reducer_ram = {}
reducer_output = {}

# input comes from STDIN
for u in mapper_output:
    if u in reducer_ram:
        reducer_ram[u] += 1
    else:
        reducer_ram[u] = 1

for u in reducer_ram:
  if reducer_ram[u] >= 10:
    reducer_output[u] = reducer_ram[u]
  else:
    continue

In [50]:
len(reducer_output)

78498

In [51]:
reducer_output["(macbeth, lady)"]

73

In [52]:
from operator import itemgetter
max = dict(sorted(reducer_output.items(), key=itemgetter(1), reverse=True)[:2])
max

{'(the, of)': 7914, '(of, the)': 7914}