# MIDS W261 Machine Learning At Scale

Christopher Llop | christopher.llop@ischool.berkeley.edu <br>
Week 5 | Submission Date: 10/6/2015


In [1]:
# Turn on autoreload for easier troubleshooting.
# This function causes iPython to re-load modules before executing code, which
#      is useful because we will be updating the MRJob code while troubleshooting.
%load_ext autoreload
%autoreload 2

<b>HW 5.0</b>

* What is a data warehouse? What is a Star schema? When is it used?

<span style="color:green"><b>Answer:</b></span>
    
    

<b>HW 5.1</b>

* In the database world What is 3NF? Does machine learning use data in 3NF? If so why? 
* In what form does ML consume data?
* Why would one use log files that are denormalized?

<span style="color:green"><b>Answer:</b></span>
    
    

<b>HW 5.2</b>

Using MRJob, implement a hashside join (memory-backed map-side) for left, 
right and inner joins. Run your code on the  data used in HW 4.4: (Recall HW 4.4: Find the most frequent visitor of each page using mrjob and the output of 4.2  (i.e., transfromed log file). In this output please include the webpage URL, webpageID and Visitor ID.)

Justify which table you chose as the Left table in this hashside join.

Please report the number of rows resulting from:

* (1) Left joining Table Left with Table Right
* (2) Right joining Table Left with Table Right
* (3) Inner joining Table Left with Table Right

<span style="color:green"><b>Answer:</b></span>
    
For this problem, I will start with the "parsed" dataset created in problem 4.4. First, I will split this into a left table and right table as pre-processing to the assignment. Then, we will merge together the tables as requested. We want to merge data from the "A" (webpage information) and "V" (customer information) tables. I will call the "A" table "Left Table" and the "V" table "Right Table".

Note that, as expected, the Left Join yields more records than the Inner Join, indicating that there are elements in the left table (website table A) that are NOT in the right table (customer table V). In plain English: Some webpages are never viewed in the dataset.

The Right Join and the Inner Join yield the same number of records, indicating that we have webpage information for each page visited by a customer.

In [70]:
# Split the parsedfile into left and right tables
with open('./parsed_msweb.data','r') as parsedfile, open('./A.data','w') as adata, \
     open("./V.data",'w') as vdata:
    for line in parsedfile:
        line = line.strip().split(',')
        if line[0] == 'A':
            # Print page_id, website
            adata.write(line[1]+','+"www.microsoft.com"+line[4].strip('"')+'\n')
        elif line[0] == 'V':
            # Print page_id, visitor_id
            vdata.write(line[1]+','+line[3]+'\n')

#### Left Join

In [159]:
%%writefile LeftJoin.py
from mrjob.job import MRJob
from mrjob.step import MRStep
from itertools import chain


class LeftJoin(MRJob):

    # MRJob Steps
    def steps(self):
        return [
            MRStep(mapper_init = self.mapper_init, 
                   mapper=self.mapper,
                   mapper_final = self.mapper_final)
               ]
    
    # Load left table from file. We purposefully store the smaller table in memory
    def mapper_init(self):
        self.left_table = {}
        for s in open('./A.data').readlines():
            s = s.split('\n')[0].split(',')
            # Left join keeps everything in the left hand side, but not things in the right
            # that don't match the left. Let's track what hasn't printed from the left
            # hand side in memory so we can print it later.
            self.left_table[s[0]] = [s[1], 0]            
                
    # Map-side join: left join
    def mapper(self, _, line):
        right_line = line.split(',')
        if self.left_table.get(right_line[0], "no entry") != "no entry":
            # Update left table to show that this entry has been found at least once
            self.left_table[right_line[0]] = [self.left_table[right_line[0]][0], 1]
            # Yield joined data | Website ID, Website URL, Customer ID
            yield None, '{},{},{}'.format(right_line[0],self.left_table[right_line[0]][0],right_line[1])

    # When done, we need to print out all the left table lines that have not yet printed
    def mapper_final(self):
        for k, v in self.left_table.iteritems():
            if v[1] != 1:
                # Yield data | Website ID, Website URL, ''
                yield None, '{},{},{}'.format(k, v[0],'')
                
    # No reducer needed for memory map-side join
    # We could use the hadoop shuffle to sort (similar to "order by")
    
if __name__ == '__main__':
    LeftJoin.run()

Overwriting LeftJoin.py


In [175]:
# Left join web and customer data
def HW5_2a():
    from LeftJoin import LeftJoin
#    import csv

    mr_job = LeftJoin(args=['V.data','--strict-protocols','--file','A.data'])
    
    with mr_job.make_runner() as runner, open('LeftJoin.data', 'w') as outfile:
        runner.run()
        for line in runner.stream_output():
            outfile.write(str(mr_job.parse_output_line(line))+'\n')

    print "There number of observations in the results dataset is:"
    !wc -l ./LeftJoin.data
    print

    print "The first 25 results are:"
    !cat ./LeftJoin.data | head -n25

    print
    print "The last 25 results are:"
    !cat ./LeftJoin.data | tail -n25

HW5_2a()

There number of observations in the results dataset is:
   98663 ./LeftJoin.data

The first 25 results are:
(None, '1000,www.microsoft.com/regwiz,10001')
(None, '1001,www.microsoft.com/support,10001')
(None, '1002,www.microsoft.com/athome,10001')
(None, '1001,www.microsoft.com/support,10002')
(None, '1003,www.microsoft.com/kb,10002')
(None, '1001,www.microsoft.com/support,10003')
(None, '1003,www.microsoft.com/kb,10003')
(None, '1004,www.microsoft.com/search,10003')
(None, '1005,www.microsoft.com/norge,10004')
(None, '1006,www.microsoft.com/misc,10005')
(None, '1003,www.microsoft.com/kb,10006')
(None, '1004,www.microsoft.com/search,10006')
(None, '1007,www.microsoft.com/ie_intl,10007')
(None, '1004,www.microsoft.com/search,10008')
(None, '1008,www.microsoft.com/msdownload,10009')
(None, '1009,www.microsoft.com/windows,10009')
(None, '1010,www.microsoft.com/vbasic,10010')
(None, '1000,www.microsoft.com/regwiz,10010')
(None, '1011,www.microsoft.com/officedev,10010')
(None, '1012,www.micr

#### Inner Join

In [178]:
%%writefile InnerJoin.py
from mrjob.job import MRJob
from mrjob.step import MRStep
from itertools import chain

# InnerJoin is same as LeftJoin, but without the need to track and the output
# elements of the left table that are not in the right table
class InnerJoin(MRJob):

    # MRJob Steps
    def steps(self):
        return [
            MRStep(mapper_init = self.mapper_init, 
                   mapper=self.mapper)
               ]
    
    # Load left table from file. We purposefully store the smaller table in memory
    def mapper_init(self):
        self.left_table = {}
        for s in open('./A.data').readlines():
            s = s.split('\n')[0].split(',')
            self.left_table[s[0]] = s[1]          
                
    # Map-side join: inner join
    def mapper(self, _, line):
        right_line = line.split(',')
        if self.left_table.get(right_line[0], "no entry") != "no entry":
            # Yield joined data | Website ID, Website URL, Customer ID
            yield None, '{},{},{}'.format(right_line[0],self.left_table[right_line[0]],right_line[1])
                
    # No reducer needed for memory map-side join
    # We could use the hadoop shuffle to sort (similar to "order by")
    
if __name__ == '__main__':
    InnerJoin.run()

Overwriting InnerJoin.py


In [179]:
# Inner join web and customer data
def HW5_2b():
    from InnerJoin import InnerJoin

    mr_job = InnerJoin(args=['V.data','--strict-protocols','--file','A.data'])
    
    with mr_job.make_runner() as runner, open('InnerJoin.data', 'w') as outfile:
        runner.run()
        for line in runner.stream_output():
            outfile.write(str(mr_job.parse_output_line(line))+'\n')

    # The inner join should yiled fewer results than the left join
    print "There number of observations in the results dataset is:"
    !wc -l ./InnerJoin.data
    print

    print "The first 25 results are:"
    !cat ./InnerJoin.data | head -n25

    print
    print "The last 25 results are:"
    !cat ./InnerJoin.data | tail -n25

HW5_2b()

There number of observations in the results dataset is:
   98654 ./InnerJoin.data

The first 25 results are:
(None, '1000,www.microsoft.com/regwiz,10001')
(None, '1001,www.microsoft.com/support,10001')
(None, '1002,www.microsoft.com/athome,10001')
(None, '1001,www.microsoft.com/support,10002')
(None, '1003,www.microsoft.com/kb,10002')
(None, '1001,www.microsoft.com/support,10003')
(None, '1003,www.microsoft.com/kb,10003')
(None, '1004,www.microsoft.com/search,10003')
(None, '1005,www.microsoft.com/norge,10004')
(None, '1006,www.microsoft.com/misc,10005')
(None, '1003,www.microsoft.com/kb,10006')
(None, '1004,www.microsoft.com/search,10006')
(None, '1007,www.microsoft.com/ie_intl,10007')
(None, '1004,www.microsoft.com/search,10008')
(None, '1008,www.microsoft.com/msdownload,10009')
(None, '1009,www.microsoft.com/windows,10009')
(None, '1010,www.microsoft.com/vbasic,10010')
(None, '1000,www.microsoft.com/regwiz,10010')
(None, '1011,www.microsoft.com/officedev,10010')
(None, '1012,www.mic

 #### Right Join

In [180]:
%%writefile RightJoin.py
from mrjob.job import MRJob
from mrjob.step import MRStep
from itertools import chain

# For right join, we simply need to output once per element in right table as we
# stream it.
class RightJoin(MRJob):

    # MRJob Steps
    def steps(self):
        return [
            MRStep(mapper_init = self.mapper_init, 
                   mapper=self.mapper)
               ]
    
    # Load left table from file. We purposefully store the smaller table in memory
    def mapper_init(self):
        self.left_table = {}
        for s in open('./A.data').readlines():
            s = s.split('\n')[0].split(',')
            self.left_table[s[0]] = s[1]          
                
    # Map-side join: right join
    def mapper(self, _, line):
        right_line = line.split(',')
        # Yield joined data | Website ID, Website URL ('' if missing), Customer ID
        yield None, '{},{},{}'.format(right_line[0],self.left_table.get(right_line[0],''),right_line[1])
                
    # No reducer needed for memory map-side join
    # We could use the hadoop shuffle to sort (similar to "order by")
    
if __name__ == '__main__':
    RightJoin.run()

Overwriting RightJoin.py


In [181]:
# Right join web and customer data
def HW5_2c():
    from RightJoin import RightJoin

    mr_job = RightJoin(args=['V.data','--strict-protocols','--file','A.data'])
    
    with mr_job.make_runner() as runner, open('RightJoin.data', 'w') as outfile:
        runner.run()
        for line in runner.stream_output():
            outfile.write(str(mr_job.parse_output_line(line))+'\n')

    # The inner join should yiled fewer results than the left join
    print "There number of observations in the results dataset is:"
    !wc -l ./RightJoin.data
    print

    print "The first 25 results are:"
    !cat ./RightJoin.data | head -n25

    print
    print "The last 25 results are:"
    !cat ./RightJoin.data | tail -n25

HW5_2c()

There number of observations in the results dataset is:
   98654 ./RightJoin.data

The first 25 results are:
(None, '1000,www.microsoft.com/regwiz,10001')
(None, '1001,www.microsoft.com/support,10001')
(None, '1002,www.microsoft.com/athome,10001')
(None, '1001,www.microsoft.com/support,10002')
(None, '1003,www.microsoft.com/kb,10002')
(None, '1001,www.microsoft.com/support,10003')
(None, '1003,www.microsoft.com/kb,10003')
(None, '1004,www.microsoft.com/search,10003')
(None, '1005,www.microsoft.com/norge,10004')
(None, '1006,www.microsoft.com/misc,10005')
(None, '1003,www.microsoft.com/kb,10006')
(None, '1004,www.microsoft.com/search,10006')
(None, '1007,www.microsoft.com/ie_intl,10007')
(None, '1004,www.microsoft.com/search,10008')
(None, '1008,www.microsoft.com/msdownload,10009')
(None, '1009,www.microsoft.com/windows,10009')
(None, '1010,www.microsoft.com/vbasic,10010')
(None, '1000,www.microsoft.com/regwiz,10010')
(None, '1011,www.microsoft.com/officedev,10010')
(None, '1012,www.mic

<b>HW 5.3</b>

For the remainder of this assignment you will work with a large subset 
of the Google n-grams dataset,

https://aws.amazon.com/datasets/google-books-ngrams/

which we have placed in a bucket on s3:

s3://filtered-5grams/

In particular, this bucket contains (~200) files in the format:

	(ngram) \t (count) \t (pages_count) \t (books_count)

Do some EDA on this dataset using mrjob, e.g., 

- Longest 5-gram (number of characters)
- Top 10 most frequent words (count), i.e., unigrams
- Most/Least densely appearing words (count/pages_count) sorted in decreasing order of relative frequency (Hint: save to PART-000* and take the head -n 1000)
- Distribution of 5-gram sizes (counts) sorted in decreasing order of relative frequency. (Hint: save to PART-000* and take the head -n 1000)

OPTIONAL Question:
- Plot the log-log plot of the frequency distributuion of unigrams. Does it follow power law distribution?

For more background see:
https://en.wikipedia.org/wiki/Log%E2%80%93log_plot
https://en.wikipedia.org/wiki/Power_law

#### Longest 5-gram

In [21]:
%%writefile longest5Gram.py
#!/Library/Frameworks/Python.framework/Versions/2.7/bin/python
from mrjob.job import MRJob
from mrjob.step import MRStep
import re
from itertools import combinations
from mrjob.protocol import RawValueProtocol

class longest5Gram(MRJob):
    
    OUTPUT_PROTOCOL = RawValueProtocol
    
    def jobconf(self):
        orig_jobconf = super(longest5Gram, self).jobconf()        
        custom_jobconf = {
            'mapred.output.key.comparator.class': 'org.apache.hadoop.mapred.lib.KeyFieldBasedComparator',
            'mapred.text.key.comparator.options': '-k1rn',
        }
        combined_jobconf = orig_jobconf
        combined_jobconf.update(custom_jobconf)
        self.jobconf = combined_jobconf
        return combined_jobconf

    def steps(self):
        return [MRStep(mapper = self.mapper, 
                       reducer_init = self.reducer_init,
                       reducer = self.reducer)]

    def mapper(self, _, line):
        line.strip()
        [ngram,count,pages,books] = re.split("\t",line)
        yield len(ngram),ngram

    # Use reducer_init to set up a variable to only output from the reducer once
    # (top result)
    def reducer_init(self):
        self.first = 0
        
    def reducer(self,count,values):
        data = {}
        if self.first < 5:
            self.first += 1
            for ngram in values:
                data[ngram] = count
            yield None,data
#         for ngram in values:
#             yield ngram,count

        

if __name__ == '__main__':
    longest5Gram.run()

Overwriting longest5Gram.py


In [22]:
!chmod +x longest5Gram.py

In [24]:
!aws s3 mb s3://ucb-mids-mls-christopher-llop-hw5/longest5Gram/

make_bucket failed: s3://ucb-mids-mls-christopher-llop-hw5/longest5Gram/ A client error (BucketAlreadyOwnedByYou) occurred when calling the CreateBucket operation: Your previous request to create the named bucket succeeded and you already own it.



In [25]:
!./longest5Gram.py s3://filtered-5grams/ -r emr \
    --output-dir=s3://ucb-mids-mls-christopher-llop-hw5/longest5Gram \
    --no-output

no configs found; falling back on auto-configuration
no configs found; falling back on auto-configuration
using existing scratch bucket mrjob-03e94e1f06830625
using s3://mrjob-03e94e1f06830625/tmp/ as our scratch dir on S3
creating tmp directory /var/folders/6f/lzgvrmnn68b1pdz9lfyc7gnh0000gn/T/longest5Gram.cjllop.20151007.035139.869156
writing master bootstrap script to /var/folders/6f/lzgvrmnn68b1pdz9lfyc7gnh0000gn/T/longest5Gram.cjllop.20151007.035139.869156/b.py

PLEASE NOTE: Starting in mrjob v0.5.0, protocols will be strict by default. It's recommended you run your job with --strict-protocols or set up mrjob.conf as described at https://pythonhosted.org/mrjob/whats-new.html#ready-for-strict-protocols

Copying non-input files into s3://mrjob-03e94e1f06830625/tmp/longest5Gram.cjllop.20151007.035139.869156/files/
Waiting 5.0s for S3 eventual consistency
Creating Elastic MapReduce job flow
Job flow created with ID: j-JO6XBAWVTCEI
Created new job flow j-JO6XBAWVTCEI
Job launched 30.5s 

In [26]:
!aws s3 cp s3://ucb-mids-mls-christopher-llop-hw5/longest5Gram/part-00000 53longest5Gram.txt

download: s3://ucb-mids-mls-christopher-llop-hw5/longest5Gram/part-00000 to ./53longest5Gram.txt


In [31]:
# Print result to screen
!cat ./ 53longest5Gram.txt | head -1

cat: ./: Is a directory
{'ROPLEZIMPREDASTRODONBRASLPKLSON YHROACLMPARCHEYXMMIOUDAVESAURUS PIOFPILOCOWERSURUASOGETSESNEGCP TYRAVOPSIFENGOQUAPIALLOBOSKENUO OWINFUYAIOKENECKSASXHYILPOYNUAT': 159, 'AIOPJUMRXUYVASLYHYPSIBEMAPODIKR UFRYDIUUOLBIGASUAURUSREXLISNAYE RNOONDQSRUNSUBUNOUGRABBERYAIRTC UTAHRAPTOREDILEIPMILBDUMMYUVERI SYEVRAHVELOCYALLOSAURUSLINROTSR': 159}	


#### Most common unigrams

In [27]:
!aws s3 sync s3://filtered-5grams ./filtered-5grams

download: s3://filtered-5grams/googlebooks-eng-all-5gram-20090715-1-filtered.txt to filtered-5grams/googlebooks-eng-all-5gram-20090715-1-filtered.txt
download: s3://filtered-5grams/googlebooks-eng-all-5gram-20090715-0-filtered.txt to filtered-5grams/googlebooks-eng-all-5gram-20090715-0-filtered.txt
download: s3://filtered-5grams/googlebooks-eng-all-5gram-20090715-10-filtered.txt to filtered-5grams/googlebooks-eng-all-5gram-20090715-10-filtered.txt
download: s3://filtered-5grams/googlebooks-eng-all-5gram-20090715-103-filtered.txt to filtered-5grams/googlebooks-eng-all-5gram-20090715-103-filtered.txt
download: s3://filtered-5grams/googlebooks-eng-all-5gram-20090715-102-filtered.txt to filtered-5grams/googlebooks-eng-all-5gram-20090715-102-filtered.txt
download: s3://filtered-5grams/googlebooks-eng-all-5gram-20090715-101-filtered.txt to filtered-5grams/googlebooks-eng-all-5gram-20090715-101-filtered.txt
download: s3://filtered-5grams/googlebooks-eng-all-5gram-20090715-106-filtered.txt to 

In [22]:
%%writefile commonUnigram.py
#!/Library/Frameworks/Python.framework/Versions/2.7/bin/python
from mrjob.job import MRJob
from mrjob.step import MRStep
import re
from operator import itemgetter
#from itertools import combinations
from mrjob.protocol import RawValueProtocol

class commonUnigram(MRJob):
    
    OUTPUT_PROTOCOL = RawValueProtocol
    
    
# Code below works on AWS only
#    def jobconf(self):
#        orig_jobconf = super(commonUnigram, self).jobconf()        
#        custom_jobconf = {
#            'mapred.output.key.comparator.class': 'org.apache.hadoop.mapred.lib.KeyFieldBasedComparator',
#            'mapred.text.key.comparator.options': '-k1rn',
#        }
#        combined_jobconf = orig_jobconf
#        combined_jobconf.update(custom_jobconf)
#        self.jobconf = combined_jobconf
#        return combined_jobconf

    def steps(self):
        return [MRStep(mapper = self.mapper_count, 
                       combiner = self.combiner_count,
                       reducer = self.reducer_count
                       #reducer = self.reducer
                      ),
                MRStep(reducer = self.reducer_rank_uni)
               ]

    def mapper_count(self, _, line):
        line.strip()
        #(ngram) \t (count) \t (pages_count) \t (books_count)
        [ngram,count,pages,books] = re.split("\t",line)
        for word in ngram.lower().split():
            yield word, int(count)

    def combiner_count(self, ngram, counts):
        yield ngram, sum(counts)
        
    def reducer_count(self, ngram, counts):
#        yield ngram, sum(counts)
        yield None, (sum(counts), ngram)        
    
    # This in-memory sort is needed if we are not running on AWS
    def reducer_rank_uni(self, _, visit_counts):
        # Find top 10,000
        top_five = sorted(visit_counts, key=lambda k: -k[0])[:10000]
        # Print each of the top 10000 nicely
        for i, result in enumerate(top_five):
            yield None, "{}\t{}".format(result[1], result[0])

        

if __name__ == '__main__':
    commonUnigram.run()

Overwriting commonUnigram.py


In [23]:
!chmod +x commonUnigram.py

In [29]:
!./commonUnigram.py ./filtered-5grams/googlebooks-eng-all-5gram-20090715-0-filtered.txt \
    --output-dir=34_unigram_count \
    --no-output \

no configs found; falling back on auto-configuration
no configs found; falling back on auto-configuration

PLEASE NOTE: Starting in mrjob v0.5.0, protocols will be strict by default. It's recommended you run your job with --strict-protocols or set up mrjob.conf as described at https://pythonhosted.org/mrjob/whats-new.html#ready-for-strict-protocols

creating tmp directory /var/folders/6f/lzgvrmnn68b1pdz9lfyc7gnh0000gn/T/commonUnigram.cjllop.20151010.225434.348380
writing to /var/folders/6f/lzgvrmnn68b1pdz9lfyc7gnh0000gn/T/commonUnigram.cjllop.20151010.225434.348380/step-0-mapper_part-00000
Counters from step 1:
  (no counters found)
writing to /var/folders/6f/lzgvrmnn68b1pdz9lfyc7gnh0000gn/T/commonUnigram.cjllop.20151010.225434.348380/step-0-mapper-sorted
> sort /var/folders/6f/lzgvrmnn68b1pdz9lfyc7gnh0000gn/T/commonUnigram.cjllop.20151010.225434.348380/step-0-mapper_part-00000
writing to /var/folders/6f/lzgvrmnn68b1pdz9lfyc7gnh0000gn/T/commonUnigram.cjllop.20151010.225434.348380/step-

In [30]:
!./commonUnigram.py ./filtered-5grams \
    --output-dir=34_unigram_count \
    --no-output \

no configs found; falling back on auto-configuration
no configs found; falling back on auto-configuration

PLEASE NOTE: Starting in mrjob v0.5.0, protocols will be strict by default. It's recommended you run your job with --strict-protocols or set up mrjob.conf as described at https://pythonhosted.org/mrjob/whats-new.html#ready-for-strict-protocols

creating tmp directory /var/folders/6f/lzgvrmnn68b1pdz9lfyc7gnh0000gn/T/commonUnigram.cjllop.20151011.002031.467869
writing to /var/folders/6f/lzgvrmnn68b1pdz9lfyc7gnh0000gn/T/commonUnigram.cjllop.20151011.002031.467869/step-0-mapper_part-00000
writing to /var/folders/6f/lzgvrmnn68b1pdz9lfyc7gnh0000gn/T/commonUnigram.cjllop.20151011.002031.467869/step-0-mapper_part-00001
writing to /var/folders/6f/lzgvrmnn68b1pdz9lfyc7gnh0000gn/T/commonUnigram.cjllop.20151011.002031.467869/step-0-mapper_part-00002
writing to /var/folders/6f/lzgvrmnn68b1pdz9lfyc7gnh0000gn/T/commonUnigram.cjllop.20151011.002031.467869/step-0-mapper_part-00003
writing to /var/

In [18]:
!./commonUnigram.py gbooks_filtered_sample.txt \
    --output-dir=1_JSON_out \
    --no-output \

no configs found; falling back on auto-configuration
no configs found; falling back on auto-configuration

PLEASE NOTE: Starting in mrjob v0.5.0, protocols will be strict by default. It's recommended you run your job with --strict-protocols or set up mrjob.conf as described at https://pythonhosted.org/mrjob/whats-new.html#ready-for-strict-protocols

creating tmp directory /var/folders/6f/lzgvrmnn68b1pdz9lfyc7gnh0000gn/T/commonUnigram.cjllop.20151010.221914.111032
writing to /var/folders/6f/lzgvrmnn68b1pdz9lfyc7gnh0000gn/T/commonUnigram.cjllop.20151010.221914.111032/step-0-mapper_part-00000
Counters from step 1:
  (no counters found)
writing to /var/folders/6f/lzgvrmnn68b1pdz9lfyc7gnh0000gn/T/commonUnigram.cjllop.20151010.221914.111032/step-0-mapper-sorted
> sort /var/folders/6f/lzgvrmnn68b1pdz9lfyc7gnh0000gn/T/commonUnigram.cjllop.20151010.221914.111032/step-0-mapper_part-00000
writing to /var/folders/6f/lzgvrmnn68b1pdz9lfyc7gnh0000gn/T/commonUnigram.cjllop.20151010.221914.111032/step-

In [16]:
# Find the visits per page
def HW5_3():
    from commonUnigram import commonUnigram
#    import csv

    mr_job = commonUnigram(args=['gbooks_filtered_sample.txt'])
    with mr_job.make_runner() as runner:
        runner.run()
        for line in runner.stream_output():
            print mr_job.parse_output_line(line)
            
HW5_3()



<b>HW 5.4</b>

In this part of the assignment we will focus on developing methods
for detecting synonyms, using the Google 5-grams dataset. To accomplish
this you must script two main tasks using MRJob:

(1) Build stripes of word co-ocurrence for the top 10,000
most frequently appearing words across the entire set of 5-grams,
and output to a file in your bucket on s3 (bigram analysis, though the words are non-contiguous).

(2) Using two (symmetric) comparison methods of your choice 
(e.g., correlations, distances, similarities), pairwise compare 
all stripes (vectors), and output to a file in your bucket on s3.

==Design notes for (1)==
For this task you will be able to modify the pattern we used in HW 3.2
(feel free to use the solution as reference). To total the word counts 
across the 5-grams, output the support from the mappers using the total 
order inversion pattern:

<*word,count>

to ensure that the support arrives before the cooccurrences.

In addition to ensuring the determination of the total word counts,
the mapper must also output co-occurrence counts for the pairs of
words inside of each 5-gram. Treat these words as a basket,
as we have in HW 3, but count all stripes or pairs in both orders,
i.e., count both orderings: (word1,word2), and (word2,word1), to preserve
symmetry in our output for (2).

==Design notes for (2)==
For this task you will have to determine a method of comparison.
Here are a few that you might consider:

- Spearman correlation
- Euclidean distance
- Taxicab (Manhattan) distance
- Shortest path graph distance (a graph, because our data is symmetric!)
- Pearson correlation
- Cosine similarity
- Kendall correlation
...

However, be cautioned that some comparison methods are more difficult to
parallelize than others, and do not perform more associations than is necessary, 
since your choice of association will be symmetric.


In [6]:
with open("stripes.txt") as infile:
    for line in infile:
        key = line.split("\t")[0]
        value = line.split("\t")[1].strip('"').split(",")
        print len(value)
        break

10000


<b>HW 5.5</b>

In this part of the assignment you will evaluate the success of you synonym detector.
Take the top 1,000 closest/most similar/correlative pairs of words as determined
by your measure in (2), and use the synonyms function in the accompanying
python code:

nltk_synonyms.py

Note: This will require installing the python nltk package:

http://www.nltk.org/install.html

and downloading its data with nltk.download().

For each (word1,word2) pair, check to see if word1 is in the list, 
synonyms(word2), and vice-versa. If one of the two is a synonym of the other, 
then consider this pair a 'hit', and then report the precision, recall, and F1 measure  of 
your detector across your 1,000 best guesses. Report the macro averages of these measures.

<b>HW 5.6 (optional)</b>

There are many good ways to build our synonym detectors, so for optional homework, 
measure co-occurrence by (left/right/all) consecutive words only, 
or make stripes according to word co-occurrences with the accompanying 
2-, 3-, or 4-grams (note here that your output will no longer 
be interpretable as a network) inside of the 5-grams.

<b>Hw 5.7 (optional)</b>

Once again, benchmark your top 1,000 associations (as in 5.5), this time for your
results from 5.6. Has your detector improved?

##### This concludes HW 5.0. Thanks for reading!