# MapReduce Assignment

Acquire Data:

For this assignment, download the Harry Potter Books data from the following link (PDF is also attached):

https://ztcprep.com/library/story/Harry_Potter/Harry_Potter_(www.ztcprep.com).pdf

Extract Data:

Select the book which corresponds to your birth month. For birth month 8-12, divide by 2 and round up.

Once you selected the book, go to page number that corresponds to your birth date (1-31) and extract next 10 pages of the book to a text file (file1.txt).

Next, go to page number that corresponds to your birth year (last 2 digits). For year 2000 onwards, use 1 infront of the year number to find the page number (so year 2000 becomes 100, 2001 - 101 and so on). Extract next 10 pages into another text file (file2.txt).



Write Code to analyze data:

1. Write Python code and use MapReduct to count occurrences of each word in the first text file (file.txt). How many times each word is repeated?

2. From the second text file (file2.txt), write Python code and use MapReduct to count how many times non-English words (names, places, spells etc.) were used. List those words and how many times each was repeated.

There are multiple ways of doing this. You can use pyenchant (https://pypi.org/project/pyenchant/), pyspellchecker (https://pyspellchecker.readthedocs.io/en/latest/) or just download a list of words (http://www.gwicks.net/dictionaries.htm) and search through them.



### Extracting files from big pdf to `file1.txt` and `file2.txt`


My Birthdate is 06/05/2002 (mm/dd/yyyy) so I am going to use this book for creating both `file1.txt` and `file2.txt` : `Harry Potter and the Half Blood Prince – J.K. Rowling`

Note: Skip this step if you already have `file1.txt` and `file2.txt`

In [15]:
!pip install pypdf



In [16]:
from pypdf import PdfReader


In [17]:
reader = PdfReader('input/Harry_Potter_(www.ztcprep.com).pdf')

In [18]:
file1_pages = reader.pages[4804:4822] # pages 4084-4822 includes text for pages 6-15 for selected book which are next 10 pages of my birthdate(5)
file2_pages = reader.pages[4957:4973] # pages 4957-4973 includes text for pages 103-112 for selected book which are next 10 pages according my birthyear(2002)

In [19]:
def create_txt_from_pdf(pages, filename):
    with open(filename,"w") as file:
      file.write('\n'.join([page.extract_text() for page in pages]))
      print(f"Created file: {filename}")


In [20]:
create_txt_from_pdf(file1_pages, "input/file1.txt")
create_txt_from_pdf(file2_pages, "input/file2.txt")

Created file: input/file1.txt
Created file: input/file2.txt


#### MapReduce Part-1

In [21]:
!pip install mrjob pyenchant



In [22]:
%%file utils/wordcount.py
from mrjob.job import MRJob
class WordCountJob(MRJob):
     def mapper(self, _, line):
         for word in line.split():
            if word:
                formatted_word = ''.join(letter for letter in word.lower() if letter.isalnum()).strip()
                if len(formatted_word) > 0:
                    yield(formatted_word, 1)

     def reducer(self, word, counts):
         yield(word, sum(counts))


if __name__ == '__main__':
    WordCountJob.run()

Overwriting utils/wordcount.py


In [23]:
!python utils/wordcount.py input/file1.txt > output/wordcount_output.txt

No configs found; falling back on auto-configuration
No configs specified for inline runner
Creating temp directory C:\Users\devdp\AppData\Local\Temp\wordcount.devdp.20241003.205209.270540
Running step 1 of 1...
job output is in C:\Users\devdp\AppData\Local\Temp\wordcount.devdp.20241003.205209.270540\output
Streaming final output from C:\Users\devdp\AppData\Local\Temp\wordcount.devdp.20241003.205209.270540\output...
Removing temp directory C:\Users\devdp\AppData\Local\Temp\wordcount.devdp.20241003.205209.270540...


#### MapReduce Part-2

In [24]:
!pip install pyenchant



In [25]:
%%file utils/non_eng_wordcount.py
from mrjob.job import MRJob
import re
import enchant

class NonEngWordCount(MRJob):

    def __init__(self, *args, **kwargs):
        super(NonEngWordCount, self).__init__(*args, **kwargs)
        self.d = enchant.Dict("en_US")

    def mapper(self, _, line):
        for word in line.split():
            if word:
                formatted_word = ''.join(letter for letter in word.lower() if letter.isalnum()).strip() 
                if formatted_word  and len(formatted_word) > 0 and not self.d.check(formatted_word):
                    yield(formatted_word, 1)

    def reducer(self, word, counts):
        yield (word, sum(counts))

if __name__ == '__main__':
    NonEngWordCount.run()

Overwriting utils/non_eng_wordcount.py


In [26]:
!python utils/non_eng_wordcount.py input/file2.txt > output/non_eng_wordcount_output.txt

No configs found; falling back on auto-configuration
No configs specified for inline runner
Creating temp directory C:\Users\devdp\AppData\Local\Temp\non_eng_wordcount.devdp.20241003.205213.676300
Running step 1 of 1...
job output is in C:\Users\devdp\AppData\Local\Temp\non_eng_wordcount.devdp.20241003.205213.676300\output
Streaming final output from C:\Users\devdp\AppData\Local\Temp\non_eng_wordcount.devdp.20241003.205213.676300\output...
Removing temp directory C:\Users\devdp\AppData\Local\Temp\non_eng_wordcount.devdp.20241003.205213.676300...
