# Map Reduce

- Basics
    - programming model for *very* large datasets
    - parallel and distributed algorithms
    - cluster framework
    - also a way of thinking
- Background
    - originally developed by Google
    - Apache Hadoop is open source implementation in Java
    - MrJob is the python interface
        - will not run in IPython notebooks, must run in a script
- Process (all about key/value pairs)
    - mapping
        - data input is in single key/value pairs
        - data is then shuffled, sorted, and aggregated according to the keys
            - data of the same key is aggregated to that key
    - reducing
        - the data dictionaries associated with the keys is reduced using a specified algorithm (sum, mean, etc.)
- MrJob `from mrjob.job import MRJob`
    - program the mapper and the reducer
    - create a class that inherits from MRJob
        - define a mapper function and a reducer function
        - call the function
        - see famous word count example below
```Python
from mrjob.job import MRJob

class mrWordCount(MRJob):
    
    # return each word (key) with a one (value)
    def mapper(self,key,line):
        for word in line.split(' '):
            yield word.lower(),1
    
    # aggregate the occurrences for each word by summing the ones
    def reducer(self,word,occurrences):
        yield word, sum(occurrences)

if __name__ == '__main__':
    mrWordCount.run()
```
- Running the script
`python pythonfile.py < inputfile.txt > outfile.txt`

- Anagram finder
```python
from mrjob.job import MRJob

class MRAnagram(MRJob):
"""Must pass file with cleaned text (all lowercase letters, no special chars) and 1 word per line"""
    
    def mapper(self, _, line):
        # convert letters to a list, sort, then convert back to string
        letters = list(line)
        letters.sort()
        
        # key is the sorted word, value is the regular word
        yield letters, line
        
    def reducer(self, _, words):
        # get the list of words containing these letters
        anagrams = [w for w in words]
        
        # only yield answer for which there are two or more words with those letters
        if len(anagrams) > 1:
            yield len(anagrams), anagrams
            
if __name__ == '__main__':
    MRAnagram.run()
```