# MapReduce Assignment


### Extracrting files from big pdf to `file1.txt` and `file2.txt`


Birthdate 07/14/2001 

Target book : `Harry Potter and the Deathly Hallows – J.K. Rowling`

Pages for file1 : `5966` to `5983` (pages 15 to 24 of the book)

Pages for file2 : `6104` to `6120` (pages 102 to 111 of the book)

In [1]:
!pip install pypdf



In [2]:
from pypdf import PdfReader

Creating function to filter required pages for file1 and file2 from whole pdf

In [3]:
def generate_files(pdf, f1_start, f1_end, f2_start, f2_end):
    reader = PdfReader(pdf)
    f1 = reader.pages[f1_start:f1_end] 
    f2 = reader.pages[f2_start:f2_end]
    return f1,f2


In [4]:
f1,f2 = generate_files('Harry_Potter_(www.ztcprep.com).pdf', 5966,5983,6104,9120)

Creating file1 and file2 from obtained pages

In [5]:
def generate_txt_files(mapping):
    for filename in mapping:
      with open(filename,"w") as file:
        file.write(' '.join([page.extract_text() for page in mapping[filename]]))
        print(f"Created file: {filename}")

In [6]:
generate_txt_files({'file1.txt' : f1, 'file2.txt' : f2})

Created file: file1.txt
Created file: file2.txt


##### Now installing required libraries

In [7]:
!pip install mrjob pyenchant



#### PART 1 - Word count

Creating a MapReduce job to count occurences of each word

In [8]:
%%file count_words.py
from mrjob.job import MRJob
import re

class CountWords(MRJob):
    def mapper(self, _, line):
        words = self.tokenize(line)
        for word in words:
            if word and len(word) > 0:
                yield (word, 1)

    def reducer(self, word, counts):
        total_count = sum(counts)
        yield (word, total_count)

    def tokenize(self, line):
        return re.findall(r'\b[a-z]+\b', line.lower())

if __name__ == '__main__':
    CountWords.run()

Writing count_words.py


Running the job

In [9]:
!python count_words.py file1.txt > file1_output.txt

No configs found; falling back on auto-configuration
No configs specified for inline runner
Creating temp directory C:\Users\devdp\AppData\Local\Temp\count_words.devdp.20241003.205000.312914
Running step 1 of 1...
job output is in C:\Users\devdp\AppData\Local\Temp\count_words.devdp.20241003.205000.312914\output
Streaming final output from C:\Users\devdp\AppData\Local\Temp\count_words.devdp.20241003.205000.312914\output...
Removing temp directory C:\Users\devdp\AppData\Local\Temp\count_words.devdp.20241003.205000.312914...


#### PART 2 - Non English Word Frequency

Creating a MapReduce job to count only valid words

In [10]:
%%file invalid_word_frequency_analyzer.py
from mrjob.job import MRJob
import re
import enchant

class InvalidWordFrequencyAnalyzer(MRJob):

    def __init__(self, *args, **kwargs):
        super(InvalidWordFrequencyAnalyzer, self).__init__(*args, **kwargs)
        self.english_dict = enchant.Dict("en_US")

    def mapper(self, _, line):
        words = self.tokenize(line.lower())
        for word in words:
            if self.is_valid_word(word):
                yield (word, 1)

    def reducer(self, word, counts):
        yield (word, sum(counts))

    def tokenize(self, text):
        return re.findall(r'\b\w+\b', text)

    def is_valid_word(self, word):
        return len(word) > 1 and not self.english_dict.check(word)

if __name__ == '__main__':
    InvalidWordFrequencyAnalyzer.run()

Writing invalid_word_frequency_analyzer.py


Running job

In [11]:
!python invalid_word_frequency_analyzer.py file2.txt > file2_output.txt

No configs found; falling back on auto-configuration
No configs specified for inline runner
Creating temp directory C:\Users\devdp\AppData\Local\Temp\invalid_word_frequency_analyzer.devdp.20241003.205002.338273
Running step 1 of 1...
job output is in C:\Users\devdp\AppData\Local\Temp\invalid_word_frequency_analyzer.devdp.20241003.205002.338273\output
Streaming final output from C:\Users\devdp\AppData\Local\Temp\invalid_word_frequency_analyzer.devdp.20241003.205002.338273\output...
Removing temp directory C:\Users\devdp\AppData\Local\Temp\invalid_word_frequency_analyzer.devdp.20241003.205002.338273...
