## 1. Write Python code and use MapReduct to count occurrences of each word in the first text file (file.txt). How many times each word is repeated?

ref_date = 08/12/2000 (mm/dd/yyyy)

In [4]:
%%file word_freq.py

# Importing necessary libraries
import re
from mrjob.job import MRJob

# Regular expression to match words
WORD_RE = re.compile(r"[\w']+")

# Defining a class for word frequency count
class MRWordFreqCount(MRJob):

    # Mapper function: splits each line into words and yields (word, 1) for each word
    def mapper(self, _, line):
        line = line.strip()  # Remove leading/trailing whitespaces
        words = line.split()  # Split line into words
        for word in words:
            yield word, 1  # Yield (word, 1) for each word

    # Reducer function: sums up the counts for each word
    def reducer(self, word, counts):
        yield word, sum(counts)  # Yield the word and its total count

if __name__ == '__main__':
    MRWordFreqCount.run()  # Run the MRJob


Overwriting word_freq.py


In [5]:
import word_freq  # Importing the MRWordFreqCount class from word_freq.py

# Creating an instance of MRWordFreqCount with the input file path as argument
mr_job = word_freq.MRWordFreqCount(args=[r"C:\Users\bharg\Downloads\file1.txt"])

# Creating a runner for the MapReduce job
with mr_job.make_runner() as runner:
    runner.run()  # Running the MapReduce job
    # Parsing the output and printing word-frequency pairs
    for key, value in mr_job.parse_output(runner.cat_output()):
        print(key, value)  # Printing each word and its frequency


No configs specified for inline runner


12 1
13 1
14 1
15 1
16 1
17 1
18 1
19 1
20 1
21 1
?” 1
A 3
All 2
And 9
As 3
Asleep 1
Aunt 2
Bertha 2
Boys. 1
Brutus’s 1
Bryce 1
But 4
Cannons 1
Center 1
Charms 1
Criminal 1
Dark 1
Do 1
Drive 2
Dudley 2
Dudley, 1
Dursleys 2
English 1
Even 1
Fire 10
Flying 1
For 1
Frank 19
Frank, 3
Frank’s 1
Goblet 10
Harry 40
Harry, 1
Harry; 1
Harry’s 3
He 19
Hedwig, 1
His 1
Hogwarts 3
Hogwarts, 1
Horrified, 1
However 1
I 17
I’ll 1
I’m 1
I’ve 1
If 2
In 1
Incurably 1
Inside 1
Instead 1
It 7
J.K. 10
Jorkins.” 1
Lord 2
Lord. 1
Lord? 1
Lord?” 1
Magic 1
Memory 1
Men 1
Ministry 2
Muggle 1
Muggle, 1
Muggle,” 2
Muggle?” 1
Muggles 1
My 3
Nagini, 1
Nagini. 1
No, 1
Now, 1
On 1
Only 1
Out 1
P 10
Peter, 1
Petunia, 2
Potter 12
Potter, 1
Privet 2
Quidditch 1
Rolls 1
Rowling 10
SCAR 1
Saturday 1
School 1
Secure 1
She 1
Slowly, 1
Something 1
St. 1
THE 1
The 14
Then 1
There 7
They 3
This 1
Though 1
Turn 1
Two 1
Uncle 2
Vernon, 2
Voldemort 8
Voldemort, 4
Voldemort. 2
Voldemort’s 1
Was 1
Well, 1
What 2
Where 1
Witchcraft 1

## 2. From the second text file (file2.txt), write Python code and use MapReduct to count how many times non-English words (names, places, spells etc.) were used. List those words and how many times each was repeated.

There are multiple ways of doing this. You can use pyenchant (https://pypi.org/project/pyenchant/), pyspellchecker (https://pyspellchecker.readthedocs.io/en/latest/) or just download a list of words (http://www.gwicks.net/dictionaries.htm) and search through them.



In [12]:
from spellchecker import SpellChecker  # Importing SpellChecker module for spell checking
import re 

def tokenize(message):
    message = message.lower()  
    all_words = re.findall(r'[a-z’]+', message)  
    return all_words

# Mapper function to yield each word with a count of 1
def mapper(document):
    for word in tokenize(document):
        yield word, 1

# Word count function to count occurrences of misspelled words
def word_count(documents):
    spell = SpellChecker()  # Create a SpellChecker 
    english_words = spell.word_frequency.dictionary.keys()  # Get a list of English words
    word_counts = {}  
    for word, count in mapper(documents):  
        if word not in english_words:  
            word_counts[word] = word_counts.get(word, 0) + count 
    for word, count in word_counts.items():  
        if count > 1:  
            print(word, count) 

# Open and read the specified file
with open(r"C:\Users\bharg\Downloads\file2.txt", encoding="utf-8") as file:
    file2 = file.read()  

word_count(file2)  # Call the word_count function 


ter 18
yeh 7
gettin’ 2
an’ 10
he’s 5
don’ 3
rowling 8
somethin’ 3
goin’ 2
hadn’t 4
hagrid 21
he’d 3
dudley 6
vernon 5
don’t 3
hogwarts 5
wasn’t 2
haven’t 2
he’ll 3
won’t 3
albus 2
dumbledore 3
didn’t 5
i’m 2
o’ 4
i’ll 2
hagrid’s 3
knuts 2
gringotts 2


### github link : https://github.com/bhargav1228/603/import