# W261-4  Homework 5

## HW 5.0

What is a data warehouse? What is a Star schema? When is it used?

## HW 5.1

In the database world What is 3NF? Does machine learning use data in 3NF? If so why? 
In what form does ML consume data?
Why would one use log files that are denormalized?

## HW 5.2

Using MRJob, implement a hashside join (memory-backed map-side) for left, 
right and inner joins. Run your code on the  data used in HW 4.4: (Recall HW 4.4: Find the most frequent visitor of each page using mrjob and the output of 4.2  (i.e., transfromed log file). In this output please include the webpage URL, webpageID and Visitor ID.)

Justify which table you chose as the Left table in this hashside join.

Please report the number of rows resulting from:

(1) Left joining Table Left with Table Right
(2) Right joining Table Left with Table Right
(3) Inner joining Table Left with Table Right

In [45]:
# take care of Jupyter and MRJob weirdnesses
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


The instructions aren't completely clear as to what input we should use, so I decided to use the original transformed web visit data. I'm extracting the page definitions into a separate file so that it can be more easily loaded into an in memory hash. I've also done the same for the customer definitions. My thinking is that the left table for a left join is the page defintions but a logical right join uses the customer definitions as the left table. The web visit data is then joined to the appropriate table as it is streamed in. Otherwise we end up with a lot of code gymnastics that don't necessarily make sense when either of these tables fit in memory just fine.

After working through this, I'm leaving this note here for now but I ended up using only the page table as the in memory table because the stream already has the customer ids joined to it in a pre-processing step.

In [2]:
# Extract just the page information from the web visit data
!grep '^A' ../../week04/HW4/anonymous-msweb.data >pagefacts.txt

# Extract just the customer information from the web visit data
!grep '^C' ../../week04/HW4/anonymous-msweb.data >customers.txt
!head pagefacts.txt
!head customers.txt

A,1287,1,"International AutoRoute","/autoroute"
A,1288,1,"library","/library"
A,1289,1,"Master Chef Product Information","/masterchef"
A,1297,1,"Central America","/centroam"
A,1215,1,"For Developers Only Info","/developer"
A,1279,1,"Multimedia Golf","/msgolf"
A,1239,1,"Microsoft Consulting","/msconsult"
A,1282,1,"home","/home"
A,1251,1,"Reference Support","/referencesupport"
A,1121,1,"Microsoft Magazine","/magazine"
C,"10001",10001
C,"10002",10002
C,"10003",10003
C,"10004",10004
C,"10005",10005
C,"10006",10006
C,"10007",10007
C,"10008",10008
C,"10009",10009
C,"10010",10010


**MRJob**

The page definitions are used as the in-memory hash as it is the smallest table. The input stream of the transformed web pages has already combined the customer id and the page visit record so a separate customer table is not required. 

In [63]:
%%writefile hashsidejoin.py
from mrjob.job import MRJob 
from mrjob.step import MRStep 

# mapper hash joins of customers, pages, and visits by customers to those pages
# we have a page table and a stream of visits by customer to pages

class MRHashsideJoin(MRJob):
    def configure_options(self):
        super(MRHashsideJoin, self).configure_options()
        self.add_passthrough_option(
            '--left', type='string', default=None)
        self.add_passthrough_option(
            '--leftjoin', action='store_true', default=False, help="perform left join")       
        self.add_passthrough_option(
            '--rightjoin', action='store_true', default=False, help="perform right join")
        self.add_passthrough_option(
            '--innerjoin', action='store_true', default=False, help="perform inner join") 
        
    # generate a dictionary of pages and URLs for them
    def mapper_lefttable_init(self):
        self.pages = {}
        # load the specified page facts table as the left table
        with open(self.options.left, 'rU') as pagedata:
            for line in pagedata:
                tokens = line.split(',')
                # the entry is the page id => ( url : count )
                self.pages[tokens[1]] = ( tokens[4].strip('\n') , 0 )
        
    # left join of the page facts table (in memory) with the streamed customer visits
    def mapper_leftjoin(self, _, record):
        self.increment_counter('Execution Counts', 'mapper calls', 1)
        tokens = record.split(',')
        # emit a key = (page_id, client_id, url) and value = 1
        if tokens[0] == 'V':
            # increment the page counter in the left table
            # tokens[1] = page id, tokens[4] = cust id
            self.pages[tokens[1]] = (self.pages[tokens[1]][0], self.pages[tokens[1]][1]+1)
            yield (tokens[1], self.pages[tokens[1]][0], tokens[4]), 1
            
    def mapper_leftjoin_final(self):
        # emit any pages that have no counts to complete the left join output
        for page in self.pages:
            if self.pages[page][1] == 0:
                yield (page, self.pages[page][0], None), 1
            
    # right join of the page facts table (in memory) with the streamed customer visits
    def mapper_rightjoin(self, _, record):
        self.increment_counter('Execution Counts', 'mapper calls', 1)
        tokens = record.split(',')

        # emit a key = (page_id, client_id, url) and value = 1
        # if a record refers to a page not in the table then the page url is None
        if tokens[0] == 'V':
            if tokens[1] in self.pages:
                yield (tokens[1], self.pages[tokens[1]][0], tokens[4]), 1
            else:
                yield (tokens[1], None, tokens[4]), 1


    # generate keys of page,customer,url and values of 1
    def mapper_innerjoin(self, _, record):
        self.increment_counter('Execution Counts', 'mapper calls', 1)
        tokens = record.split(',')
            
        # emit a key = (page_id, client_id, url) and value = 1
        if tokens[0] == 'V':
            if tokens[1] in self.pages:
                yield (tokens[1], self.pages[tokens[1]][0], tokens[4]), 1
        
        
    # combine page visits by key where the key is page,customer
    def combiner(self, key, counts): 
        self.increment_counter('Execution Counts', 'combiner count', 1)
        # sum the keys we've seen so far.
        # the key is (page_id, cust_id, page_url) so we're counting page views by client
       
    # keep a list of the pages not seen for the left side join
    def reducer_init(self):
        self.unseen_pages = []
        self.seen_pages = {}
            
    # use a reducer to sort the output and also to constrain the pages not seen from multiple mappers
    def reducer(self, key, counts):
        self.increment_counter('Execution Counts', 'reducer count', 1)
        # if we haven't seen this page (the customer id is None) add it to the list
        if key[2] is None:
            self.unseen_pages.append((key[0], key[1]))
        else:
            # if this page is in our unseen pages list, remove it because we've seen it after all
            if (key[0],key[1]) in self.unseen_pages:
                self.unseen_pages.remove(key[0],key[1])
            yield (key, sum(counts))
    
    # yield any pages remaining in the unseen pages list after all is said and done
    def reducer_final(self):
        for page, url in self.unseen_pages:
            yield (page, url, None), 0
      
    # define the execution steps for left join, right join, and inner join
    def steps(self):
        if self.options.leftjoin:
            return[MRStep(mapper_init=self.mapper_lefttable_init,
                          mapper=self.mapper_leftjoin,
                          mapper_final=self.mapper_leftjoin_final,
                          reducer_init=self.reducer_init,
                          reducer=self.reducer,
                          reducer_final=self.reducer_final)]
        elif self.options.rightjoin:
            return[MRStep(mapper_init=self.mapper_lefttable_init,
                          mapper=self.mapper_rightjoin,
                          reducer_init=self.reducer_init,
                          reducer=self.reducer,
                          reducer_final=self.reducer_final)]
        elif self.options.innerjoin:
            return[MRStep(mapper_init=self.mapper_lefttable_init,
                          mapper=self.mapper_innerjoin,
                          reducer_init=self.reducer_init,
                          reducer=self.reducer,
                          reducer_final=self.reducer_final)]
                       
    
if __name__ == '__main__': 
    MRMostFrequentVisitors.run()

Overwriting hashsidejoin.py


**Driver**

The driver takes as a command line parameter `--leftjoin`, `--rightjoin`, `--innerjoin` and passes that to the MRJob to perform the request join.

In [70]:
%%writefile joindriver.py

from sys import argv
from hashsidejoin import MRHashsideJoin

if len(argv) > 1:
    join = argv[1]
else:
    join = '--leftjoin'
    
print join

datafile = '/Users/rcordell/Documents/MIDS/W261/week04/HW4/anonymous-msweb-transformed.data'
pagefacts = '/Users/rcordell/Documents/MIDS/W261/week05/HW5/pagefacts.txt'

def main(datafile, pagefile):
    
    mr_job = MRHashsideJoin(args=[datafile, '--left='+pagefile, join, '--strict-protocols'])

    with mr_job.make_runner() as runner:
        runner.run()

        for line in runner.stream_output():
            (page, customer, url), count =  mr_job.parse_output_line(line)
            print '{0:5s}{1:20s}{2:10s}{3:4d}'.format(page,customer,url,count)

if __name__ == '__main__':
    main(datafile, pagefacts)


Overwriting joindriver.py


The left join shows that there are pages defined that are not visited.
The right join does not indicate pages visited that are not defined.

In [72]:
!python joindriver.py --leftjoin
#!python joindriver.py --rightjoin
#!python joindriver.py --innerjoin

--leftjoin
1000 "/regwiz"           10001        1
1000 "/regwiz"           10010        1
1000 "/regwiz"           10039        1
1000 "/regwiz"           10073        1
1000 "/regwiz"           10087        1
1000 "/regwiz"           10101        1
1000 "/regwiz"           10132        1
1000 "/regwiz"           10141        1
1000 "/regwiz"           10154        1
1000 "/regwiz"           10162        1
1000 "/regwiz"           10166        1
1000 "/regwiz"           10201        1
1000 "/regwiz"           10218        1
1000 "/regwiz"           10220        1
1000 "/regwiz"           10324        1
1000 "/regwiz"           10348        1
1000 "/regwiz"           10376        1
1000 "/regwiz"           10384        1
1000 "/regwiz"           10409        1
1000 "/regwiz"           10429        1
1000 "/regwiz"           10454        1
1000 "/regwiz"           10457        1
1000 "/regwiz"           10471        1
1000 "/regwiz"           10497        1
1000 "/regwiz"           1051

## HW 5.3  EDA of Google n-grams dataset

For the Google n-grams dataset unit test and regression test your code using the 
first 10 lines of the following file:

googlebooks-eng-all-5gram-20090715-0-filtered.txt

Finally show your results on the Google n-grams dataset. 

In particular, this bucket contains (~200) files (10Meg each) in the format:

	(ngram) \t (count) \t (pages_count) \t (books_count)

Do some EDA on this dataset using mrjob, e.g., 

- Longest 5-gram (number of characters)
- Top 10 most frequent words (please use the count information), i.e., unigrams
- 20 Most/Least densely appearing words (count/pages_count) sorted in decreasing order of relative frequency 
- Distribution of 5-gram sizes (character length).  E.g., count (using the count field) up how many times a 5-gram of 50 characters shows up. Plot the data graphically using a histogram.

HW 5.3.1 OPTIONAL Question:
- Plot the log-log plot of the frequency distributuion of unigrams. Does it follow power law distribution?

For more background see:
https://en.wikipedia.org/wiki/Log%E2%80%93log_plot
https://en.wikipedia.org/wiki/Power_law

**MRJob to calculate the longest 5-gram**

In [1052]:
%%writefile ngramlength.py
from mrjob.job import MRJob 
from mrjob.step import MRStep
from mrjob.conf import combine_dicts


class MRNgramLength(MRJob):
    def jobconf(self):
        orig_jobconf = super(MRNgramLength, self).jobconf()
        
        custom_jobconf = {
            'mapred.map.tasks' : 28,
            'mapred.reduce.tasks' : 28
        }

        return combine_dicts(orig_jobconf, custom_jobconf)    

    # Get the 5gram, split into unigrams and sum the letters
    def mapper_ngram_length(self, _, line):
        self.increment_counter('Execution Counts', 'mapper ngram_length', 1)
        ngrams = line.split()[:5]
        yield None, sum(len(word) for word in ngrams)

    # yield the max value of all the ngram letter counts
    def reducer_ngram_length(self, _, count):
        yield None, max(count)
                    
    # define the execution steps for left join, right join, and inner join
    def steps(self):
        return[MRStep(mapper=self.mapper_ngram_length,
                      reducer=self.reducer_ngram_length)]  
        
if __name__ == '__main__': 
    MRNgramLength.run()

Overwriting ngramlength.py


**Execution of MRJob for longest 5-gram** 

Local test data

Set up configuration file

In [1066]:
%%writefile mrjob.conf
include: /Users/rcordell/.mrjob.conf
runners:
    hadoop:
        hadoop_home: '/usr/local/Cellar/hadoop/2.7.1/libexec'

    emr:
        ssh_tunnel_to_job_tracker : true
        ec2_instance_type : m1.medium
        num_ec2_instances : 9
        enable_emr_debugging: true

Overwriting mrjob.conf


In [1030]:
!hdfs dfs -rm -r /user/rcordell/mrjob
!python ngramlength.py --conf-path '/Users/rcordell/Documents/MIDS/W261/week05/HW5/mrjob.conf' \
  --strict-protocols \
  --quiet \
  -r hadoop < '/Users/rcordell/Documents/MIDS/W261/week05/HW5/filtered-5Grams/testdata.txt' 

16/02/15 19:12:31 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
rm: `/user/rcordell/mrjob': No such file or directory
null	29


In [1053]:
!hdfs dfs -rm -r /user/rcordell/mrjob
!aws s3 rm s3://w261-rlc-hw5/mrjob_out --recursive
!python ngramlength.py --conf-path '/Users/rcordell/Documents/MIDS/W261/week05/HW5/mrjob.conf' \
  --output-dir='s3://w261-rlc-hw5/mrjob_out/' \
  --strict-protocols \
  -r emr s3://filtered-5grams/*.txt

16/02/15 20:55:38 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
rm: `/user/rcordell/mrjob': No such file or directory
Got unexpected keyword arguments: ssh_tunnel
inferring aws_region from scratch bucket's region (us-west-1)
using s3://mrjob-ff1bb0ea96bd6412/tmp/ as our scratch dir on S3
creating tmp directory /var/folders/z_/rfp5q2cd6db13d19v6yw0n8w0000gn/T/ngramlength.rcordell.20160216.045539.260160
writing master bootstrap script to /var/folders/z_/rfp5q2cd6db13d19v6yw0n8w0000gn/T/ngramlength.rcordell.20160216.045539.260160/b.py
Copying non-input files into s3://mrjob-ff1bb0ea96bd6412/tmp/ngramlength.rcordell.20160216.045539.260160/files/
Waiting 5.0s for S3 eventual consistency
Creating Elastic MapReduce job flow
Job flow created with ID: j-UJOTVK0SVJO6
Created new job flow j-UJOTVK0SVJO6
Job launched 30.2s ago, status STARTING: Provisioning Amazon EC2 capacity
Job launched 60.8s ago, status STARTIN

In [1044]:
!aws s3 cp s3://w261-rlc-hw5/mrjob_out/part-00002 output.txt
!cat output.txt | tail -20

download: s3://w261-rlc-hw5/mrjob_out/part-00002 to ./output.txt
"aristocrat"	1.7906976744186047
"Dock"	1.8028169014084507
"pilage"	1.8333333333333333
"traitorously"	1.8928571428571428
"Rumanian"	1.904320987654321
"theres"	1.9230769230769231
"unreachable"	1.9433962264150944
"apiece"	1.9607843137254901
"Kiowa"	2.0
"Pathology"	2.0213017751479292
"Phe"	2.0408163265306123
"Saving"	2.1129032258064515
"denatured"	2.1864406779661016
"Gynecological"	2.2481536189069424
"houseless"	2.274891774891775
"bust"	2.3493975903614457
"operand"	2.353448275862069
"Expiration"	2.510204081632653
"Honourable"	2.8927536231884057
"lak"	3.072289156626506


**MRJob to calculate Top 10 unigram (word) occurences**

In [1059]:
%%writefile ngramtop10.py
from mrjob.job import MRJob 
from mrjob.step import MRStep
from mrjob.conf import combine_dicts
import heapq


class MRNgramTop10(MRJob):
    def jobconf(self):
        orig_jobconf = super(MRNgramTop10, self).jobconf()
        
        custom_jobconf = {
            'mapred.map.tasks' : 28,
            'mapred.reduce.tasks' : 28
        }

        return combine_dicts(orig_jobconf, custom_jobconf) 
    
    # Extract the unigrams from the 5-grams and yield for counting
    def mapper_unigram_count(self, _, ngram):
        unigrams = ngram.split()[:5]
        for unigram in unigrams:
            yield unigram, 1
    
    # pass thru mapper to get around a bug in MRJob
    def mapper_unigram_top10(self, key, value):
        yield key, value
    
    # Combiner for the unigram count
    def combiner_unigram_count(self, unigram, count):
        yield unigram, sum(count)

    # combine sums for each unigram and change the key, value to sort on count
    def reducer_unigram_count(self, unigram, count):
        yield None, (sum(count), unigram)
        
    # use a heapq sort to yield the top10 unigrams by count
    def reducer_unigram_top10(self, _, unigram_count_pairs):
        for count, unigram in heapq.nlargest(10, unigram_count_pairs):
            yield unigram, count
            
    # define the execution steps for left join, right join, and inner join
    def steps(self):
        return[MRStep(mapper=self.mapper_unigram_count,
                      combiner=self.combiner_unigram_count,
                      reducer=self.reducer_unigram_count),
               MRStep(mapper=self.mapper_unigram_top10,
                      reducer=self.reducer_unigram_top10)]
        
if __name__ == '__main__': 
    MRNgramTop10.run()

Overwriting ngramtop10.py


**Execution of MRJob to calculate Top 10 Unigram Frequencies**

In [1067]:
!hdfs dfs -rm -r /user/rcordell/mrjob
!python ngramtop10.py -c mrjob.conf \
  --strict-protocols \
  --quiet \
  -r hadoop < '/Users/rcordell/Documents/MIDS/W261/week05/HW5/filtered-5Grams/testdata.txt' 

16/02/15 21:51:30 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
rm: `/user/rcordell/mrjob': No such file or directory
"A"	20
"of"	12
"the"	7
"Study"	4
"and"	3
"Case"	3
"in"	2
"a"	2
"Framework"	2
"Conceptual"	2


In [1068]:
!hdfs dfs -rm -r /user/rcordell/mrjob
!aws s3 rm s3://w261-rlc-hw5/mrjob_out --recursive
!python ngramtop10.py -c mrjob.conf \
  --output-dir='s3://w261-rlc-hw5/mrjob_out/' \
  --strict-protocols \
  -r emr s3://filtered-5grams/*.txt

16/02/15 21:55:40 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
rm: `/user/rcordell/mrjob': No such file or directory
delete: s3://w261-rlc-hw5/mrjob_out/_SUCCESS
delete: s3://w261-rlc-hw5/mrjob_out/part-00009     
delete: s3://w261-rlc-hw5/mrjob_out/part-00002     
delete: s3://w261-rlc-hw5/mrjob_out/part-00010     
delete: s3://w261-rlc-hw5/mrjob_out/part-00004     
delete: s3://w261-rlc-hw5/mrjob_out/part-00005     
delete: s3://w261-rlc-hw5/mrjob_out/part-00011     
delete: s3://w261-rlc-hw5/mrjob_out/part-00001     
delete: s3://w261-rlc-hw5/mrjob_out/part-00007     
delete: s3://w261-rlc-hw5/mrjob_out/part-00008     
delete: s3://w261-rlc-hw5/mrjob_out/part-00003      
delete: s3://w261-rlc-hw5/mrjob_out/part-00000      
delete: s3://w261-rlc-hw5/mrjob_out/part-00006      
delete: s3://w261-rlc-hw5/mrjob_out/part-00012      
delete: s3://w261-rlc-hw5/mrjob_out/part-00014      
delete: s3://w261-rl

**Word Density**

In [928]:
%%writefile ngramdensity.py
from mrjob.job import MRJob 
from mrjob.step import MRStep
from mrjob import conf
import re


class MRNgramDensity(MRJob):
    
    SORT_VALUES = True

    def configure_options(self):
        super(MRNgramDensity, self).configure_options()
        
    def jobconf(self):
        orig_jobconf = super(MRNgramDensity, self).jobconf()
        
        custom_jobconf = {
            'mapreduce.partition.keypartitioner.options': '-k2,2nr',
            'mapreduce.job.output.key.comparator.class' :
              'org.apache.hadoop.mapred.lib.KeyFieldBasedComparator',
            'mapreduce.partition.keycomparator.options': '-k1 -k2nr'
        }

        return conf.combine_dicts(orig_jobconf, custom_jobconf)
        
    # Get each word of an ngram and emit the word, (count, pages count)
    def mapper(self, _, line):
        self.increment_counter('Execution Counts', 'mapper', 1)
        counts = {}
        line.strip()
        try:
            [ngram,count,pages,books] = re.split("\t",line)
            count = int(count)
            pages = int(pages)
            words = re.split(" ",ngram)        
            for word in words:
                yield word, (count, pages)
        except:
            print "ERROR"
            print line
            
    # combine intermediate counts
    def combiner(self, word, counts):
        self.increment_counter('Execution Counts', 'combiner', 1)            
        yield word, map(sum, zip(*counts))
    
    # reducer merge final counts and reverse the key,value
    def reducer(self, word, counts):
        self.increment_counter('Execution Counts', 'reducer', 1)
        yield None, (word, map(sum, zip(*counts)))
    
    def mapper_topn(self, _, values):
        self.increment_counter('Execution Counts', 'mapper_topn', 1)
        yield None, (values[0], float(values[1][0])/values[1][1])

    def mapper_topn2(self, _, values):
        self.increment_counter('Execution Counts', 'mapper_topn', 1)
        yield None, (float(values[1][0])/values[1][1], values[0])
        
    def reducer_topn(self, _, values):
        self.increment_counter('Execution Counts', 'reducer_topn', 1)
        for value in values:
            yield value[0], value[1]

    def reducer_topn2(self, _, values):
        self.increment_counter('Execution Counts', 'reducer_topn', 1)
        for value in values:
            yield value[1], value[0]
        
    # define the execution steps for left join, right join, and inner join
    def steps(self):
        return[MRStep(mapper=self.mapper,
                      combiner=self.combiner,
                      reducer=self.reducer),
               MRStep(mapper=self.mapper_topn2,
                      reducer=self.reducer_topn2)
              ]  
        
if __name__ == '__main__': 
    MRNgramDensity.run()

Overwriting ngramdensity.py


In [1054]:
%%writefile mrjob.conf
include ~/.mrjob.conf
runners:
    hadoop:
        hadoop_home: '/usr/local/Cellar/hadoop/2.7.1/libexec'

    emr:
        ssh_tunnel_to_job_tracker : true
        ec2_instance_type : m1.medium
        num_ec2_instances : 6
        enable_emr_debugging: true

Overwriting mrjob.conf


In [916]:
!hdfs dfs -rm -r /user/rcordell/mrjob
!python ngramdensity.py --conf-path '/Users/rcordell/Documents/MIDS/W261/week05/HW5/mrjob.conf' \
  --strict-protocols \
  -r local < '/Users/rcordell/Documents/MIDS/W261/week05/HW5/filtered-5Grams/testdata.txt' 

16/02/14 21:09:28 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/02/14 21:09:29 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.
Deleted /user/rcordell/mrjob
No configs specified for local runner
ignoring partitioner keyword arg (requires real Hadoop): 'org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner'
creating tmp directory /var/folders/z_/rfp5q2cd6db13d19v6yw0n8w0000gn/T/ngramdensity.rcordell.20160215.050929.588950
writing wrapper script to /var/folders/z_/rfp5q2cd6db13d19v6yw0n8w0000gn/T/ngramdensity.rcordell.20160215.050929.588950/setup-wrapper.sh
Detected hadoop configuration property names that do not match hadoop version 0.20:
The have been translated as follows
 mapreduce.job.output.key.comparator.class: mapred.output.key.comparator.class
mapreduce.partition.keypartitioner.options: mapred.text.key.partitioner.options
ma

In [929]:
!hdfs dfs -rm -r /user/rcordell/mrjob
!python ngramdensity.py --conf-path '/Users/rcordell/Documents/MIDS/W261/week05/HW5/mrjob.conf' \
  --strict-protocols \
  --output-dir mrjob/out \
  --no-output \
  -r hadoop < '/Users/rcordell/Documents/MIDS/W261/week05/HW5/filtered-5Grams/googlebooks-eng-all-5gram-20090715-0-filtered.txt' 

16/02/14 21:23:05 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/02/14 21:23:05 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.
Deleted /user/rcordell/mrjob
creating tmp directory /var/folders/z_/rfp5q2cd6db13d19v6yw0n8w0000gn/T/ngramdensity.rcordell.20160215.052306.402981
writing wrapper script to /var/folders/z_/rfp5q2cd6db13d19v6yw0n8w0000gn/T/ngramdensity.rcordell.20160215.052306.402981/setup-wrapper.sh
reading from STDIN
Using Hadoop version 2.7.1
Copying local files into hdfs:///user/rcordell/tmp/mrjob/ngramdensity.rcordell.20160215.052306.402981/files/
Detected hadoop configuration property names that do not match hadoop version 2.7.1:
The have been translated as follows
 mapred.text.key.comparator.options: mapreduce.partition.keycomparator.options
mapred.output.key.comparator.class: mapreduce.job.output.key.comparator.class

In [930]:
!hdfs dfs -cat /user/rcordell/mrjob/out/part-00000 | tail -20

16/02/14 21:30:45 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
"aristocrat"	1.7906976744186047
"Dock"	1.8028169014084507
"pilage"	1.8333333333333333
"traitorously"	1.8928571428571428
"Rumanian"	1.904320987654321
"theres"	1.9230769230769231
"unreachable"	1.9433962264150944
"apiece"	1.9607843137254901
"Kiowa"	2.0
"Pathology"	2.021301775147929
"Phe"	2.0408163265306123
"Saving"	2.1129032258064515
"denatured"	2.1864406779661016
"Gynecological"	2.2481536189069424
"houseless"	2.274891774891775
"bust"	2.3493975903614457
"operand"	2.353448275862069
"Expiration"	2.510204081632653
"Honourable"	2.8927536231884057
"lak"	3.072289156626506


In [931]:
!aws s3 rm s3://w261-rlc-hw5/mrjob_out --recursive
!python ngramdensity.py --conf-path '/Users/rcordell/Documents/MIDS/W261/week05/HW5/mrjob.conf' \
  --strict-protocols \
  --output-dir='s3://w261-rlc-hw5/mrjob_out/' \
  --no-output \
  -r emr < '/Users/rcordell/Documents/MIDS/W261/week05/HW5/filtered-5Grams/googlebooks-eng-all-5gram-20090715-0-filtered.txt' 

delete: s3://w261-rlc-hw5/mrjob_out/part-00000
delete: s3://w261-rlc-hw5/mrjob_out/part-00001   
delete: s3://w261-rlc-hw5/mrjob_out/part-00002   
delete: s3://w261-rlc-hw5/mrjob_out/_SUCCESS     
Got unexpected keyword arguments: ssh_tunnel
inferring aws_region from scratch bucket's region (us-west-1)
using s3://mrjob-ff1bb0ea96bd6412/tmp/ as our scratch dir on S3
creating tmp directory /var/folders/z_/rfp5q2cd6db13d19v6yw0n8w0000gn/T/ngramdensity.rcordell.20160215.053108.996694
writing master bootstrap script to /var/folders/z_/rfp5q2cd6db13d19v6yw0n8w0000gn/T/ngramdensity.rcordell.20160215.053108.996694/b.py
reading from STDIN
Copying non-input files into s3://mrjob-ff1bb0ea96bd6412/tmp/ngramdensity.rcordell.20160215.053108.996694/files/
Waiting 5.0s for S3 eventual consistency
Creating Elastic MapReduce job flow
Job flow created with ID: j-16OJH2BBEQUCT
Created new job flow j-16OJH2BBEQUCT
Detected hadoop configuration property names that do not match hadoop version 1.0.3:
The have

In [None]:
!hdfs dfs -rm -r /user/rcordell/mrjob
!python ngramdensitydriver.py < '/Users/rcordell/Documents/MIDS/W261/week05/HW5/filtered-5Grams/testdata.txt' 

In [1011]:
%%writefile ngramlengthfreq.py
from mrjob.job import MRJob 
from mrjob.step import MRStep
from mrjob import conf
import re


class MRNgramLengthFreq(MRJob):
    
    SORT_VALUES = True
        
    def jobconf(self):
        orig_jobconf = super(MRNgramLengthFreq, self).jobconf()        
        custom_jobconf = {
            'stream.num.map.output.key.fields': 2,
            'mapred.output.key.comparator.class' :
              'org.apache.hadoop.mapred.lib.KeyFieldBasedComparator',
            'mapred.text.key.comparator.options': '-k2n -k1nr'
        }
        return conf.combine_dicts(orig_jobconf, custom_jobconf)    
    
    # Get each word of an ngram and emit the word, (count, pages count)
    def mapper(self, _, line):
        self.increment_counter('Execution Counts', 'mapper', 1)
        counts = {}
        line.strip()
        [ngram,count,pages,books] = re.split("\t",line)
        count = int(count)
        words = ngram.strip(" ")       
        yield len(words), count
                
    # combine intermediate counts
    def combiner(self, ngram_length, count):
        self.increment_counter('Execution Counts', 'combiner', 1)            
        yield ngram_length, sum(count)
    
    # reducer merge final counts and reverse the key,value
    def reducer(self, ngram_length, count):
        self.increment_counter('Execution Counts', 'reducer', 1)
        yield None, (sum(count), ngram_length)
    
    # sorting reducer
    def reducer_sort(self, _, count_length_pair):
        for count, length in count_length_pair:
            yield length, count 
    
    
    
    # define the execution steps for left join, right join, and inner join
    def steps(self):
        return[MRStep(mapper=self.mapper,
                      combiner=self.combiner,
                      reducer=self.reducer),
               MRStep(reducer=self.reducer_sort)
              ]  
        
if __name__ == '__main__': 
    MRNgramLengthFreq.run()

Overwriting ngramlengthfreq.py


In [1055]:
%%writefile mrjob_length.conf
include: ~/.mrjob.conf
runners:
    hadoop:
        hadoop_home: '/usr/local/Cellar/hadoop/2.7.1/libexec'

    emr:
        ssh_tunnel_to_job_tracker : true
        ec2_instance_type : m1.large
        num_ec2_instances : 10
        enable_emr_debugging: true

Overwriting mrjob_length.conf


In [1047]:
!hdfs dfs -rm -r /user/rcordell/mrjob
!python ngramlengthfreq.py --conf-path '/Users/rcordell/Documents/MIDS/W261/week05/HW5/mrjob_length.conf' \
  --strict-protocols \
  -r local < '/Users/rcordell/Documents/MIDS/W261/week05/HW5/filtered-5Grams/testdata.txt' 

16/02/15 20:50:46 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
rm: `/user/rcordell/mrjob': No such file or directory
No configs specified for local runner
ignoring partitioner keyword arg (requires real Hadoop): 'org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner'
creating tmp directory /var/folders/z_/rfp5q2cd6db13d19v6yw0n8w0000gn/T/ngramlengthfreq.rcordell.20160216.045047.923487
writing wrapper script to /var/folders/z_/rfp5q2cd6db13d19v6yw0n8w0000gn/T/ngramlengthfreq.rcordell.20160216.045047.923487/setup-wrapper.sh
reading from STDIN
writing to /var/folders/z_/rfp5q2cd6db13d19v6yw0n8w0000gn/T/ngramlengthfreq.rcordell.20160216.045047.923487/step-0-mapper_part-00000
> sh -ex setup-wrapper.sh /Users/rcordell/Documents/MIDS/W261/W261env/bin/python ngramlengthfreq.py --step-num=0 --mapper --strict-protocols /var/folders/z_/rfp5q2cd6db13d19v6yw0n8w0000gn/T/ngramlengthfreq.rcordell.20160216.045047.9234

In [1014]:
!hdfs dfs -rm -r /user/rcordell/mrjob
!python ngramlengthfreq.py --conf-path '/Users/rcordell/Documents/MIDS/W261/week05/HW5/mrjob_length.conf' \
  --strict-protocols \
  --output-dir mrjob/out \
  --no-output \
  -r hadoop < '/Users/rcordell/Documents/MIDS/W261/week05/HW5/filtered-5Grams/testdata.txt' 

16/02/15 00:32:36 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
rm: `/user/rcordell/mrjob': No such file or directory
creating tmp directory /var/folders/z_/rfp5q2cd6db13d19v6yw0n8w0000gn/T/ngramlengthfreq.rcordell.20160215.083237.439512
writing wrapper script to /var/folders/z_/rfp5q2cd6db13d19v6yw0n8w0000gn/T/ngramlengthfreq.rcordell.20160215.083237.439512/setup-wrapper.sh
reading from STDIN
Using Hadoop version 2.7.1
Copying local files into hdfs:///user/rcordell/tmp/mrjob/ngramlengthfreq.rcordell.20160215.083237.439512/files/
Detected hadoop configuration property names that do not match hadoop version 2.7.1:
The have been translated as follows
 mapred.text.key.partitioner.options: mapreduce.partition.keypartitioner.options
mapred.output.key.comparator.class: mapreduce.job.output.key.comparator.class
mapred.text.key.comparator.options: mapreduce.partition.keycomparator.options
HADOOP: Unable to load

In [1015]:
!hdfs dfs -cat /user/rcordell/mrjob/out/part-00000

16/02/15 00:33:21 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
28	1099
22	447
23	197
27	123
22	153
24	116
29	145
27	110
30	163
26	102
31	68
29	144
17	62
31	49
33	121
22	42
23	55


## HW 5.4  (over 2Gig of Data)

In this part of the assignment we will focus on developing methods
for detecting synonyms, using the Google 5-grams dataset. To accomplish
this you must script two main tasks using MRJob:

(1) Build stripes of word co-ocurrence for the top 10,000 using the words ranked from 9001-10,000 as a basis
most frequently appearing words across the entire set of 5-grams,
and output to a file in your bucket on s3 (bigram analysis, though the words are non-contiguous).

(2) Using two (symmetric) comparison methods of your choice 
(e.g., correlations, distances, similarities), pairwise compare 
all stripes (vectors), and output to a file in your bucket on s3.

==Design notes for (1)==
For this task you will be able to modify the pattern we used in HW 3.2
(feel free to use the solution as reference). To total the word counts 
across the 5-grams, output the support from the mappers using the total 
order inversion pattern:

<*word,count>

to ensure that the support arrives before the cooccurrences.

In addition to ensuring the determination of the total word counts,
the mapper must also output co-occurrence counts for the pairs of
words inside of each 5-gram. Treat these words as a basket,
as we have in HW 3, but count all stripes or pairs in both orders,
i.e., count both orderings: (word1,word2), and (word2,word1), to preserve
symmetry in our output for (2).

==Design notes for (2)==
For this task you will have to determine a method of comparison.
Here are a few that you might consider:

- Jaccard
- Cosine similarity
- Spearman correlation
- Euclidean distance
- Taxicab (Manhattan) distance
- Shortest path graph distance (a graph, because our data is symmetric!)
- Pearson correlation
- Kendall correlation
...

However, be cautioned that some comparison methods are more difficult to
parallelize than others, and do not perform more associations than is necessary, 
since your choice of association will be symmetric.

Please use the inverted index (discussed in live session #5) based pattern to compute the pairwise (term-by-term) similarity matrix. 

#### MapReduce Job to find the top 10000 frequent words in our corpus of 5-grams ####

This is a slightly modified version of the previous MR job to find the top 10 most frequent unigrams from the 5-gram corpus. 

First we create a local mrjob.conf file to create an appropriate AWS EMR cluster on which to run our MRJob code

In [1076]:
%%writefile mrjob.conf
include: /Users/rcordell/.mrjob.conf
runners:
    hadoop:
        hadoop_home: '/usr/local/Cellar/hadoop/2.7.1/libexec'

    emr:
        ssh_tunnel_to_job_tracker : true
        ec2_instance_type : m3.xlarge
        num_ec2_instances : 8
        enable_emr_debugging: true

Overwriting mrjob.conf


** MR Job to find top 10000 frequent unigrams (words)**

In [1093]:
%%writefile ngramtop10k.py
from mrjob.job import MRJob 
from mrjob.step import MRStep
from mrjob.conf import combine_dicts
import heapq


class MRNgramTop10K(MRJob):
    def jobconf(self):
        orig_jobconf = super(MRNgramTop10K, self).jobconf()
        
        custom_jobconf = {
#            'mapred.map.tasks' : 28,
#            'mapred.reduce.tasks' : 28
        }

        return combine_dicts(orig_jobconf, custom_jobconf) 
    
    # Extract the unigrams from the 5-grams and yield for counting
    def mapper_unigram_count(self, _, ngram):
        unigrams = ngram.split()[:5]
        for unigram in unigrams:
            yield unigram.lower(), 1
    
    # pass thru mapper to get around a bug in MRJob
    def mapper_unigram_top10k(self, key, value):
        yield key, value
    
    # Combiner for the unigram count
    def combiner_unigram_count(self, unigram, count):
        yield unigram, sum(count)

    # combine sums for each unigram and change the key, value to sort on count
    # the mapper isn't strictly necessary except that it allows us to swap
    # the key around to be able to make use of heapq sort
    def reducer_unigram_count(self, unigram, count):
        yield None, (sum(count), unigram)
        
    # use a heapq sort to yield the top10 unigrams by count
    def reducer_unigram_top10k(self, _, unigram_count_pairs):
        for count, unigram in heapq.nlargest(10000, unigram_count_pairs):
            yield unigram, count
            
    # define the execution steps for left join, right join, and inner join
    def steps(self):
        return[MRStep(mapper=self.mapper_unigram_count,
                      combiner=self.combiner_unigram_count,
                      reducer=self.reducer_unigram_count),
               MRStep(mapper=self.mapper_unigram_top10k,
                      reducer=self.reducer_unigram_top10k)]
        
if __name__ == '__main__': 
    MRNgramTop10K.run()

Overwriting ngramtop10k.py


**Execution of MRJob to calculate Top 10 Unigram Frequencies**

First test it on the local hadoop instance

In [1094]:
!hdfs dfs -rm -r /user/rcordell/mrjob
!python ngramtop10K.py -c mrjob.conf \
  --strict-protocols \
  --quiet \
  -r hadoop < '/Users/rcordell/Documents/MIDS/W261/week05/HW5/filtered-5Grams/testdata.txt' 

16/02/16 21:54:01 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
rm: `/user/rcordell/mrjob': No such file or directory
"a"	22
"of"	12
"the"	7
"study"	4
"case"	3
"and"	3
"in"	2
"framework"	2
"for"	2
"conceptual"	2
"collection"	2
"wales"	1
"tales"	1
"sea"	1
"royal"	1
"review"	1
"religious"	1
"properties"	1
"on"	1
"narrative"	1
"limited"	1
"life"	1
"letters"	1
"juvenile"	1
"his"	1
"guide"	1
"government"	1
"george"	1
"general"	1
"forms"	1
"female"	1
"fairy"	1
"establishing"	1
"defence"	1
"critique"	1
"critical"	1
"continuation"	1
"concise"	1
"comparison"	1
"comparative"	1
"commentary"	1
"city"	1
"circumstantial"	1
"christmas"	1
"child's"	1
"by"	1
"biography"	1
"bill"	1
"bibliography"	1
"apology"	1


Now run it on the AWS EMR cluster against the entire corpus

In [1095]:
!hdfs dfs -rm -r /user/rcordell/mrjob
!aws s3 rm s3://w261-rlc-hw5/mrjob_out --recursive
!python ngramtop10K.py -c mrjob.conf \
  --output-dir='s3://w261-rlc-hw5/mrjob_out/' \
  --no-output \
  --strict-protocols \
  -r emr s3://filtered-5grams/*.txt

16/02/16 21:56:06 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
rm: `/user/rcordell/mrjob': No such file or directory
delete: s3://w261-rlc-hw5/mrjob_out/_SUCCESS
delete: s3://w261-rlc-hw5/mrjob_out/part-00009     
delete: s3://w261-rlc-hw5/mrjob_out/part-00010     
delete: s3://w261-rlc-hw5/mrjob_out/part-00003     
delete: s3://w261-rlc-hw5/mrjob_out/part-00011     
delete: s3://w261-rlc-hw5/mrjob_out/part-00012     
delete: s3://w261-rlc-hw5/mrjob_out/part-00002     
delete: s3://w261-rlc-hw5/mrjob_out/part-00000     
delete: s3://w261-rlc-hw5/mrjob_out/part-00001     
delete: s3://w261-rlc-hw5/mrjob_out/part-00005     
delete: s3://w261-rlc-hw5/mrjob_out/part-00013      
delete: s3://w261-rlc-hw5/mrjob_out/part-00008      
delete: s3://w261-rlc-hw5/mrjob_out/part-00004      
delete: s3://w261-rlc-hw5/mrjob_out/part-00014      
delete: s3://w261-rlc-hw5/mrjob_out/part-00007      
delete: s3://w261-rl

Pull the result local and let's have a look at the top 10 and check the number of lines in the file

In [1098]:
!aws s3 ls s3://w261-rlc-hw5/mrjob_out/

2016-02-16 22:13:58          0 _SUCCESS
2016-02-16 22:13:44     155908 part-00000
2016-02-16 22:13:40          0 part-00001
2016-02-16 22:13:36          0 part-00002
2016-02-16 22:13:41          0 part-00003
2016-02-16 22:13:41          0 part-00004
2016-02-16 22:13:37          0 part-00005
2016-02-16 22:13:41          0 part-00006
2016-02-16 22:13:43          0 part-00007
2016-02-16 22:13:39          0 part-00008
2016-02-16 22:13:40          0 part-00009
2016-02-16 22:13:39          0 part-00010
2016-02-16 22:13:40          0 part-00011
2016-02-16 22:13:40          0 part-00012
2016-02-16 22:13:39          0 part-00013
2016-02-16 22:13:46          0 part-00014
2016-02-16 22:13:46          0 part-00015
2016-02-16 22:13:49          0 part-00016
2016-02-16 22:13:49          0 part-00017
2016-02-16 22:13:49          0 part-00018
2016-02-16 22:13:49          0 part-00019
2016-02-16 22:13:50          0 part-00020
2016-02-16 22:13:49          0 part-00021
2016-02-16 22

In [1099]:
!aws s3 cp s3://w261-rlc-hw5/mrjob_out/part-00000 top10k.txt

download: s3://w261-rlc-hw5/mrjob_out/part-00000 to ./top10k.txt


In [1100]:
!cat top10k.txt | head -10
!cat top10k.txt | wc -l

"the"	27502442
"of"	18191779
"to"	12075971
"in"	7881239
"a"	7853465
"and"	7767900
"that"	4316884
"is"	3847383
"be"	3288731
"for"	2763613
cat: stdout: Broken pipe
   10000


Take the bottom 1K words from the top 10K frequency to use as our stripe terms and have a look at the first and last 10 lines of the resulting file.

In [1101]:
!cat top10k.txt | tail -1000 > stripe_terms.txt
!cat stripe_terms.txt | wc -l
!cat stripe_terms.txt | head -10
!echo "------------------------"
!cat stripe_terms.txt | tail -10
!aws s3 cp stripe_terms.txt s3://w261-rlc-hw5/mrjob_in

    1000
"jane"	1450
"establishments"	1450
"summon"	1448
"commandments"	1448
"rains"	1447
"skins"	1446
"sack"	1446
"nationalist"	1446
"complementary"	1446
"atrial"	1446
------------------------
"rectangular"	1191
"projecting"	1191
"implements"	1191
"hindered"	1191
"emblem"	1191
"delusion"	1191
"weeping"	1190
"marshall"	1190
"logs"	1190
"heresy"	1190
upload: ./stripe_terms.txt to s3://w261-rlc-hw5/mrjob_in


In [5]:
!aws s3 cp stripe_terms.txt s3://w261-rlc-hw5/mrjob_in/

upload: ./stripe_terms.txt to s3://w261-rlc-hw5/mrjob_in/stripe_terms.txt


In [1]:
%%writefile stripes.py
from mrjob.job import MRJob 
from mrjob.step import MRStep
from mrjob.conf import combine_dicts
from itertools import combinations
import re

class MRStripes(MRJob):
    def configure_options(self):
        super(MRStripes, self).configure_options()
        self.add_file_option('--stripeterms')
        
    def jobconf(self):
        orig_jobconf = super(MRStripes, self).jobconf()
        
        custom_jobconf = {
            'mapred.map.tasks' : 20,
            'mapred.reduce.tasks' : 20
        }

        return combine_dicts(orig_jobconf, custom_jobconf) 
    
    # Load the stripe words into a list
    # The strip word file should be loaded into distributed cache
    def mapper_init(self):
        self.stripe_words = []
        with open(self.options.stripeterms) as stripe_word_file:
            for line in stripe_word_file.readlines():
                self.stripe_words.append(re.sub('"','',re.split('\t',line)[0]))
                
    
    # if any words in the ngram are in the list of stripe terms
    # yield all 2 word combinations one time then break to the next ngram
    def mapper(self, _, line):
        self.increment_counter('Execution Counts', 'mapper', 1)
        pairs = None
        line.strip()
        ngram = re.split('\t',line)[0]
        terms = re.split(' ', ngram)
        words = [term.lower() for term in terms]

        # if there is any match in the stripe word list then
        # yield all bigrams from the 5-gram but only once
        for term in terms:
            if term.lower() in self.stripe_words:
                pairs = combinations(words, 2)
                break
        if pairs is not None:
            for pair in pairs:
                yield pair[0], {pair[1]: 1}
    
    # Combiner for the word pairs
    def combiner(self, term, coterms):
        stripes = {}
        stripes.setdefault(term,None)
        for occurance in coterms:
            for word, count in occurance.iteritems():
                if stripes[term] is not None:
                    if word in stripes[term]:
                        stripes[term][word] += count
                    else:
                        stripes[term][word] = count
                else:
                    stripes[term] = occurance
        for stripe in stripes:
            yield stripe, stripes[stripe]

    # Create the stripes
    def reducer(self, term, coterms):
        stripes = {}
        stripes.setdefault(term,None)
        for occurance in coterms:
            for word, count in occurance.iteritems():
                if stripes[term] is not None:
                    if word in stripes[term]:
                        stripes[term][word] += count
                    else:
                        stripes[term][word] = count
                else:
                    stripes[term] = occurance
        for stripe in stripes:
            yield stripe, stripes[stripe]
            
    # define the execution steps for left join, right join, and inner join
    def steps(self):
        return[MRStep(mapper_init=self.mapper_init,
                      mapper=self.mapper,
                      combiner=self.combiner,
                      reducer=self.reducer)]
        
if __name__ == '__main__': 
    MRStripes.run()

Overwriting stripes.py


Keep a local copy of mrjob.conf for this MR job

In [2]:
%%writefile mrjob.conf
include: /Users/rcordell/.mrjob.conf
runners:
    hadoop:
        hadoop_home: '/usr/local/Cellar/hadoop/2.7.1/libexec'

    emr:
        ssh_tunnel_to_job_tracker : true
        ec2_instance_type : m3.xlarge
        num_ec2_instances : 8
        enable_emr_debugging: true

Overwriting mrjob.conf


In [None]:
!hdfs dfs -rm -r /user/rcordell/mrjob
!python stripes.py -c mrjob.conf \
  --strict-protocols \
  -r hadoop \
  --stripeterms=stripe_terms.txt '/Users/rcordell/Documents/MIDS/W261/week05/HW5/filtered-5Grams/googlebooks-eng-all-5gram-20090715-0-filtered.txt'

In [6]:
!aws s3 rm s3://w261-rlc-hw5/mrjob_out --recursive
!python stripes.py -c mrjob.conf \
  --output-dir='s3://w261-rlc-hw5/mrjob_out/' \
  --no-output \
  --strict-protocols \
  -r emr --stripeterms=s3://w261-rlc-hw5/mrjob_in/stripe_terms.txt  s3://filtered-5grams/*.txt

Got unexpected keyword arguments: ssh_tunnel
inferring aws_region from scratch bucket's region (us-west-1)
using s3://mrjob-ff1bb0ea96bd6412/tmp/ as our scratch dir on S3
creating tmp directory /var/folders/0r/s2rd23cd1m7b10d2x3f_0zkm0000gn/T/stripes.rcordell.20160217.191709.605326
writing master bootstrap script to /var/folders/0r/s2rd23cd1m7b10d2x3f_0zkm0000gn/T/stripes.rcordell.20160217.191709.605326/b.py
Copying non-input files into s3://mrjob-ff1bb0ea96bd6412/tmp/stripes.rcordell.20160217.191709.605326/files/
Waiting 5.0s for S3 eventual consistency
Creating Elastic MapReduce job flow
Job flow created with ID: j-2YDJD8LUVXEG6
Created new job flow j-2YDJD8LUVXEG6
Job launched 30.2s ago, status STARTING: Provisioning Amazon EC2 capacity
Job launched 60.4s ago, status STARTING: Provisioning Amazon EC2 capacity
Job launched 90.6s ago, status STARTING: Provisioning Amazon EC2 capacity
Job launched 120.9s ago, status STARTING: Configuring cluster software
Job launched 151.0s ago, status

In [7]:
!aws s3 ls s3://w261-rlc-hw5/mrjob_out/

2016-02-17 11:33:35          0 _SUCCESS
2016-02-17 11:33:18    1325907 part-00000
2016-02-17 11:33:30    1275835 part-00001
2016-02-17 11:33:20    1676719 part-00002
2016-02-17 11:33:18    1398173 part-00003
2016-02-17 11:33:29    1423169 part-00004
2016-02-17 11:33:18    1372071 part-00005
2016-02-17 11:33:19    1873503 part-00006
2016-02-17 11:33:18    1179524 part-00007
2016-02-17 11:33:18    1216220 part-00008
2016-02-17 11:33:17    1259077 part-00009
2016-02-17 11:33:18    2071448 part-00010
2016-02-17 11:33:18    1351068 part-00011
2016-02-17 11:33:16    1214820 part-00012
2016-02-17 11:33:16    1254734 part-00013
2016-02-17 11:33:29    1383473 part-00014
2016-02-17 11:33:33    1374188 part-00015
2016-02-17 11:33:26    1779121 part-00016
2016-02-17 11:33:30    1100636 part-00017
2016-02-17 11:33:41    1254305 part-00018
2016-02-17 11:33:42    1518639 part-00019


In [13]:
!for i in {0..9}; do aws s3 cp s3://w261-rlc-hw5/mrjob_out/part-0000$i stripes$i.txt; done
!for i in {0..9}; do aws s3 cp s3://w261-rlc-hw5/mrjob_out/part-0001$i stripes1$i.txt; done

download: s3://w261-rlc-hw5/mrjob_out/part-00000 to ./stripes0.txt
download: s3://w261-rlc-hw5/mrjob_out/part-00001 to ./stripes1.txt
download: s3://w261-rlc-hw5/mrjob_out/part-00002 to ./stripes2.txt
download: s3://w261-rlc-hw5/mrjob_out/part-00003 to ./stripes3.txt
download: s3://w261-rlc-hw5/mrjob_out/part-00004 to ./stripes4.txt
download: s3://w261-rlc-hw5/mrjob_out/part-00005 to ./stripes5.txt
download: s3://w261-rlc-hw5/mrjob_out/part-00006 to ./stripes6.txt
download: s3://w261-rlc-hw5/mrjob_out/part-00007 to ./stripes7.txt
download: s3://w261-rlc-hw5/mrjob_out/part-00008 to ./stripes8.txt
download: s3://w261-rlc-hw5/mrjob_out/part-00009 to ./stripes9.txt
download: s3://w261-rlc-hw5/mrjob_out/part-00010 to ./stripes10.txt
download: s3://w261-rlc-hw5/mrjob_out/part-00011 to ./stripes11.txt
download: s3://w261-rlc-hw5/mrjob_out/part-00012 to ./stripes12.txt
download: s3://w261-rlc-hw5/mrjob_out/part-00013 to ./stripes13.txt
download: s3://w261-rlc-hw5/mrjob_out/part-00014 to ./stri

In [18]:
!rm stripes.txt
!cat stripes*.txt >>stripes.txt
!cat stripes.txt | wc -l
!head -10 stripes.txt

   50567
"aaronic"	{"and": 2, "the": 1, "priesthood": 5, "its": 1}
"abbas"	{"i": 1, "persia": 2, "of": 2}
"abdullah"	{"tariki": 1, "of": 6, "had": 1, "ii": 1, "jordan": 5, "saudi": 2, "arabia": 2}
"abettor"	{"and": 1, "of": 2, "supporter": 2}
"ablaze"	{"of": 1, "the": 2, "with": 2, "splendor": 2}
"abolition"	{"and": 2, "all": 2, "licensed": 2, "in": 2, "religious": 1, "oaths": 1, "was": 1, "pecuniary": 1, "preparatory": 1, "delaware": 1, "system": 1, "aristocratic": 2, "toll": 1, "department": 1, "restraints": 1, "rite": 1, "plural": 3, "priesthood": 1, "lecturer": 2, "inquisition": 3, "asserts": 1, "that": 1, "who": 1, "unnecessary": 1, "restrictive": 5, "traffic": 1, "licence": 1, "petitions": 2, "law": 1, "subscription": 2, "a": 2, "opium": 1, "dictatorship": 2, "sanctions": 1, "of": 53, "tribal": 1, "slavery": 4, "monopolies": 1, "while": 1, "converted": 1, "qualification": 1, "act": 1, "the": 22, "voting": 1, "penal": 1}
"abound"	{"and": 1, "practical": 1, "chronicle": 1, "strikin

In [19]:
!aws s3 cp stripes.txt s3://w261-rlc-hw5/mrjob_in/

upload: ./stripes.txt to s3://w261-rlc-hw5/mrjob_in/stripes.txt


In [3]:
!cat stripes.txt | head -10 > inversiontest.txt

cat: stdout: Broken pipe


In [5]:
!cat inversiontest.txt

"aaronic"	{"and": 2, "the": 1, "priesthood": 5, "its": 1}
"abbas"	{"i": 1, "persia": 2, "of": 2}
"abdullah"	{"tariki": 1, "of": 6, "had": 1, "ii": 1, "jordan": 5, "saudi": 2, "arabia": 2}
"abettor"	{"and": 1, "of": 2, "supporter": 2}
"ablaze"	{"of": 1, "the": 2, "with": 2, "splendor": 2}
"abolition"	{"and": 2, "all": 2, "licensed": 2, "in": 2, "religious": 1, "oaths": 1, "was": 1, "pecuniary": 1, "preparatory": 1, "delaware": 1, "system": 1, "aristocratic": 2, "toll": 1, "department": 1, "restraints": 1, "rite": 1, "plural": 3, "priesthood": 1, "lecturer": 2, "inquisition": 3, "asserts": 1, "that": 1, "who": 1, "unnecessary": 1, "restrictive": 5, "traffic": 1, "licence": 1, "petitions": 2, "law": 1, "subscription": 2, "a": 2, "opium": 1, "dictatorship": 2, "sanctions": 1, "of": 53, "tribal": 1, "slavery": 4, "monopolies": 1, "while": 1, "converted": 1, "qualification": 1, "act": 1, "the": 22, "voting": 1, "penal": 1}
"abound"	{"and": 1, "practical": 1, "chronicle": 1, "striking":

In [103]:
%%writefile cosine.py
from mrjob.job import MRJob 
from mrjob.step import MRStep
from mrjob.conf import combine_dicts
from itertools import combinations
from math import sqrt
import re

class MRcosine(MRJob):        
   
    # take apart each stripe, count the coterms and calculate the
    # normal vector length, emit each coterm with it's "document" and length
    def mapper(self, _, line):
        self.increment_counter('Execution Counts', 'mapper', 1)
        term, coterms = line.strip().split('\t')
        # length of the vector
        norm_length = 1.0/sqrt(len(eval(coterms).keys()))
        for coterm in eval(coterms):
            yield re.sub('"','',coterm), {re.sub('"','',term) : norm_length}
            
    # Partition on terms and collect the "documents"
    # yield the inversion of the documents and terms
    def reducer(self, term, docs):
        self.increment_counter('Execution Counts', 'reducer', 1)
        postings = {}
        # iterate throught document dictionaries
        for doc in docs:
            # turn into an actual dictionary and iterate through 
            for item in doc:
                if item in postings:
                    postings[item] += doc[item]
                else:
                    postings[item] = doc[item]
        yield term, postings

    # for each posting yield a pair-wise combination of the documents in the posting list
    # as a tuple key and the product of their weights as the value
    def pairwise_cosine_mapper(self, word, postings):
        postings_list = []
        self.increment_counter('Execution Counts', 'pairwise_cosine_mapper', 1)
        for posting in postings:
             postings_list.append((posting, postings[posting]))
        for doc_pair in combinations(postings_list,2):
            yield (doc_pair[0][0],doc_pair[1][0]),(doc_pair[0][1]*doc_pair[1][1])

    # reducer produces the total pairwise matrix values
    def pairwise_cosine_reducer(self, doc_pair, weights):
        yield doc_pair, sum(weights)
        
    # define the execution steps for left join, right join, and inner join
    def steps(self):
        return[MRStep(mapper=self.mapper,
#                      combiner=self.combiner)]
                      reducer=self.reducer),
#                      jobconf = {
#                        'mapred.map.tasks' : 20,
#                        'mapred.reduce.tasks' : 20
#                      }
#                     )]
               MRStep(mapper=self.pairwise_cosine_mapper,
                      reducer=self.pairwise_cosine_reducer)]
        
if __name__ == '__main__': 
    MRcosine.run()

Overwriting cosine.py


In [104]:
!hdfs dfs -rm -r /user/rcordell/mrjob
!python cosine.py -c mrjob.conf \
  --strict-protocols \
  -r local 'inversiontest.txt'

16/02/17 17:33:27 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
rm: Call From localhost/127.0.0.1 to localhost:9000 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
No configs specified for local runner
creating tmp directory /var/folders/0r/s2rd23cd1m7b10d2x3f_0zkm0000gn/T/cosine.rcordell.20160218.013328.257037
writing wrapper script to /var/folders/0r/s2rd23cd1m7b10d2x3f_0zkm0000gn/T/cosine.rcordell.20160218.013328.257037/setup-wrapper.sh
writing to /var/folders/0r/s2rd23cd1m7b10d2x3f_0zkm0000gn/T/cosine.rcordell.20160218.013328.257037/step-0-mapper_part-00000
> sh -ex setup-wrapper.sh /Users/rcordell/Documents/MIDS/W261/W261env/bin/python cosine.py --step-num=0 --mapper --strict-protocols /var/folders/0r/s2rd23cd1m7b10d2x3f_0zkm0000gn/T/cosine.rcordell.20160218.013328.257037/input_part-00000 > /var/fo

In [105]:
%%writefile mrjob.conf
include: /Users/rcordell/.mrjob.conf
runners:
    hadoop:
        hadoop_home: '/usr/local/Cellar/hadoop/2.7.1/libexec'

    emr:
        ssh_tunnel_to_job_tracker : true
        ec2_instance_type : m3.xlarge
        num_ec2_instances : 8
        enable_emr_debugging: true

Overwriting mrjob.conf


In [106]:
!aws s3 rm s3://w261-rlc-hw5/mrjob_out --recursive
!python cosine.py -c mrjob.conf \
  --output-dir='s3://w261-rlc-hw5/mrjob_out/' \
  --no-output \
  --strict-protocols \
  -r emr s3://w261-rlc-hw5/mrjob_in/stripes.txt

delete: s3://w261-rlc-hw5/mrjob_out/part-00005
delete: s3://w261-rlc-hw5/mrjob_out/part-00008     
delete: s3://w261-rlc-hw5/mrjob_out/part-00000     
delete: s3://w261-rlc-hw5/mrjob_out/part-00002     
delete: s3://w261-rlc-hw5/mrjob_out/part-00004     
delete: s3://w261-rlc-hw5/mrjob_out/part-00009     
delete: s3://w261-rlc-hw5/mrjob_out/part-00012     
delete: s3://w261-rlc-hw5/mrjob_out/part-00010     
delete: s3://w261-rlc-hw5/mrjob_out/part-00013     
delete: s3://w261-rlc-hw5/mrjob_out/part-00011     
delete: s3://w261-rlc-hw5/mrjob_out/part-00014      
delete: s3://w261-rlc-hw5/mrjob_out/part-00017      
delete: s3://w261-rlc-hw5/mrjob_out/part-00015     
delete: s3://w261-rlc-hw5/mrjob_out/part-00016     
delete: s3://w261-rlc-hw5/mrjob_out/_SUCCESS       
delete: s3://w261-rlc-hw5/mrjob_out/part-00018     
delete: s3://w261-rlc-hw5/mrjob_out/part-00003     
delete: s3://w261-rlc-hw5/mrjob_out/part-00006     
delete: s3://w261-rlc-hw5/mrjob_out/part-00001     
delete: s3://w2

## HW 5.5

In this part of the assignment you will evaluate the success of you synonym detector.
Take the top 1,000 closest/most similar/correlative pairs of words as determined
by your measure in (2), and use the synonyms function in the accompanying
python code:

nltk_synonyms.py

Note: This will require installing the python nltk package:

http://www.nltk.org/install.html

and downloading its data with nltk.download().

For each (word1,word2) pair, check to see if word1 is in the list, 
synonyms(word2), and vice-versa. If one of the two is a synonym of the other, 
then consider this pair a 'hit', and then report the precision, recall, and F1 measure  of 
your detector across your 1,000 best guesses. Report the macro averages of these measures.

## HW 5.5.1 (optional)
There is also a corpus of stopwords, that is, high-frequency words like "the", "to" and "also" that we sometimes want to filter out of a document before further processing. Stopwords usually have little lexical content, and their presence in a text fails to distinguish it from other texts. Python's nltk comes with a prebuilt list of stopwords (see below). Using this stopword list filter out these tokens from your analysis and rerun the experiments in 5.5 and disucuss the results of using a stopword list and without using a stopword list.

    >> from nltk.corpus import stopwords
    >>> stopwords.words('english')
    ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours',
    'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers',
    'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves',
    'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are',
    'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does',
    'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until',
    'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into',
    'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down',
    'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here',
    'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',
    'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so',
    'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now']

## HW 5.6 (optional)

There are many good ways to build our synonym detectors, so for optional homework, 
measure co-occurrence by (left/right/all) consecutive words only, 
or make stripes according to word co-occurrences with the accompanying 
2-, 3-, or 4-grams (note here that your output will no longer 
be interpretable as a network) inside of the 5-grams.

## HW 5.7 (optional)

Once again, benchmark your top 10,000 associations (as in 5.5), this time for your
results from 5.6. Has your detector improved?
