# MapReduce Design Patterns

## A Job Posting Dataset

The sample dataset we will use (*data/job-data/job-data-2018-09-08-00-00-37.txt*) contains job postings from on one of the US job search websites. The data is stored with each row as a JSON document representing a job posting record. 

The example below shows a sample job postings from the data file. The sample record has been formatted with 4 spaces indentation. In the real file, each record is stored as a JSON document in one row.

## 1. Data Inj

In [118]:
%%file mr-test.py

import json

from mrjob.job import MRJob
from mrjob.protocol import JSONValueProtocol, RawValueProtocol

class MRWordFrequencyCount(MRJob):
    
    INPUT_PROTOCOL = JSONValueProtocol
    OUTPUT_PROTOCOL = RawValueProtocol
    
    def mapper(self, _, value):
        yield _, '|'.join(value)

        
if __name__ == '__main__':
    MRWordFrequencyCount.run()

Overwriting mr-test.py


In [119]:
!python3 mr-test.py ../data/job-data/job-data-2018-09-08-00-00-37.txt --output-dir test-out

No configs found; falling back on auto-configuration
No configs specified for inline runner
Running step 1 of 1...
Creating temp directory /tmp/mr-test.hadoop.20180913.144414.839390
job output is in test-out
Removing temp directory /tmp/mr-test.hadoop.20180913.144414.839390...


## 1. Filtering

### 1.1 Simple Filtering

Question: 

### 1.2 Random Sampling

In [95]:
class Solution(object):
    def isMatch(self, s, p):
        """
        :type s: str
        :type p: str
        :rtype: bool
        """
        dp = [[False]*(len(s)+1) for _ in range(len(p)+1)]
        dp[0][0] = True
        for i in range(len(p)):
            if p[i] == '*':
                dp[i+1][0] = dp[i-1][0]
                for j in range(len(s)):
                    dp[i+1][j+1] = dp[i-1][j+1] | dp[i][j+1]
                    if p[i-1] == '.':
                        dp[i+1][j+1] = dp[i+1][j+1] | any(dp[i][:j+1])
                    else:
                        dp[i+1][j+1] = dp[i+1][j+1] | dp[i][j] & (s[j-1] == s[j])
            else:
                for j in range(len(s)):
                    dp[i+1][j+1] = dp[i][j] & (p[i] in (s[j], '.'))
        for x in dp:
            print(x)
        return dp[-1][-1]

In [96]:
soln = Solution()
s = 'aaa'
p = '.*'

res = soln.isMatch(s, p)
print(res)

[True, False, False, False]
[False, True, False, False]
[True, True, True, True]
True


In [94]:
any([1,1,1])

True

In [63]:
True & True

True

In [66]:
(1 | 1) & 0

0

In [61]:
%timeit 0 & 1

10.3 ns ± 0.0788 ns per loop (mean ± std. dev. of 7 runs, 100000000 loops each)
