# MIDS W261 Machine Learning At Scale

Christopher Llop | christopher.llop@ischool.berkeley.edu <br>
Week 5 | Submission Date: 10/6/2015


In [1]:
# Turn on autoreload for easier troubleshooting.
# This function causes iPython to re-load modules before executing code, which
#      is useful because we will be updating the MRJob code while troubleshooting.
%load_ext autoreload
%autoreload 2

<b>HW 5.0</b>

* What is a data warehouse? What is a Star schema? When is it used?

<span style="color:green"><b>Answer:</b></span>
    
    

<b>HW 5.1</b>

* In the database world What is 3NF? Does machine learning use data in 3NF? If so why? 
* In what form does ML consume data?
* Why would one use log files that are denormalized?

<span style="color:green"><b>Answer:</b></span>
    
    

<b>HW 5.2</b>

Using MRJob, implement a hashside join (memory-backed map-side) for left, 
right and inner joins. Run your code on the  data used in HW 4.4: (Recall HW 4.4: Find the most frequent visitor of each page using mrjob and the output of 4.2  (i.e., transfromed log file). In this output please include the webpage URL, webpageID and Visitor ID.)

Justify which table you chose as the Left table in this hashside join.

Please report the number of rows resulting from:

* (1) Left joining Table Left with Table Right
* (2) Right joining Table Left with Table Right
* (3) Inner joining Table Left with Table Right

<span style="color:green"><b>Answer:</b></span>
    
For this problem, I will start with the "parsed" dataset created in problem 4.4. First, I will split this into a left table and right table as pre-processing to the assignment. Then, we will merge together the tables as requested. We want to merge data from the "A" (webpage information) and "V" (customer information) tables. I will call the "A" table "Left Table" and the "V" table "Right Table".

Note that, as expected, the Left Join yields more records than the Inner Join, indicating that there are elements in the left table (website table A) that are NOT in the right table (customer table V). In plain English: Some webpages are never viewed in the dataset.

The Right Join and the Inner Join yield the same number of records, indicating that we have webpage information for each page visited by a customer.

In [70]:
# Split the parsedfile into left and right tables
with open('./parsed_msweb.data','r') as parsedfile, open('./A.data','w') as adata, \
     open("./V.data",'w') as vdata:
    for line in parsedfile:
        line = line.strip().split(',')
        if line[0] == 'A':
            # Print page_id, website
            adata.write(line[1]+','+"www.microsoft.com"+line[4].strip('"')+'\n')
        elif line[0] == 'V':
            # Print page_id, visitor_id
            vdata.write(line[1]+','+line[3]+'\n')

#### Left Join

In [159]:
%%writefile LeftJoin.py
from mrjob.job import MRJob
from mrjob.step import MRStep
from itertools import chain


class LeftJoin(MRJob):

    # MRJob Steps
    def steps(self):
        return [
            MRStep(mapper_init = self.mapper_init, 
                   mapper=self.mapper,
                   mapper_final = self.mapper_final)
               ]
    
    # Load left table from file. We purposefully store the smaller table in memory
    def mapper_init(self):
        self.left_table = {}
        for s in open('./A.data').readlines():
            s = s.split('\n')[0].split(',')
            # Left join keeps everything in the left hand side, but not things in the right
            # that don't match the left. Let's track what hasn't printed from the left
            # hand side in memory so we can print it later.
            self.left_table[s[0]] = [s[1], 0]            
                
    # Map-side join: left join
    def mapper(self, _, line):
        right_line = line.split(',')
        if self.left_table.get(right_line[0], "no entry") != "no entry":
            # Update left table to show that this entry has been found at least once
            self.left_table[right_line[0]] = [self.left_table[right_line[0]][0], 1]
            # Yield joined data | Website ID, Website URL, Customer ID
            yield None, '{},{},{}'.format(right_line[0],self.left_table[right_line[0]][0],right_line[1])

    # When done, we need to print out all the left table lines that have not yet printed
    def mapper_final(self):
        for k, v in self.left_table.iteritems():
            if v[1] != 1:
                # Yield data | Website ID, Website URL, ''
                yield None, '{},{},{}'.format(k, v[0],'')
                
    # No reducer needed for memory map-side join
    # We could use the hadoop shuffle to sort (similar to "order by")
    
if __name__ == '__main__':
    LeftJoin.run()

Overwriting LeftJoin.py


In [175]:
# Left join web and customer data
def HW5_3a():
    from LeftJoin import LeftJoin
#    import csv

    mr_job = LeftJoin(args=['V.data','--strict-protocols','--file','A.data'])
    
    with mr_job.make_runner() as runner, open('LeftJoin.data', 'w') as outfile:
        runner.run()
        for line in runner.stream_output():
            outfile.write(str(mr_job.parse_output_line(line))+'\n')

    print "There number of observations in the results dataset is:"
    !wc -l ./LeftJoin.data
    print

    print "The first 25 results are:"
    !cat ./LeftJoin.data | head -n25

    print
    print "The last 25 results are:"
    !cat ./LeftJoin.data | tail -n25

HW5_3a()

There number of observations in the results dataset is:
   98663 ./LeftJoin.data

The first 25 results are:
(None, '1000,www.microsoft.com/regwiz,10001')
(None, '1001,www.microsoft.com/support,10001')
(None, '1002,www.microsoft.com/athome,10001')
(None, '1001,www.microsoft.com/support,10002')
(None, '1003,www.microsoft.com/kb,10002')
(None, '1001,www.microsoft.com/support,10003')
(None, '1003,www.microsoft.com/kb,10003')
(None, '1004,www.microsoft.com/search,10003')
(None, '1005,www.microsoft.com/norge,10004')
(None, '1006,www.microsoft.com/misc,10005')
(None, '1003,www.microsoft.com/kb,10006')
(None, '1004,www.microsoft.com/search,10006')
(None, '1007,www.microsoft.com/ie_intl,10007')
(None, '1004,www.microsoft.com/search,10008')
(None, '1008,www.microsoft.com/msdownload,10009')
(None, '1009,www.microsoft.com/windows,10009')
(None, '1010,www.microsoft.com/vbasic,10010')
(None, '1000,www.microsoft.com/regwiz,10010')
(None, '1011,www.microsoft.com/officedev,10010')
(None, '1012,www.micr

#### Inner Join

In [178]:
%%writefile InnerJoin.py
from mrjob.job import MRJob
from mrjob.step import MRStep
from itertools import chain

# InnerJoin is same as LeftJoin, but without the need to track and the output
# elements of the left table that are not in the right table
class InnerJoin(MRJob):

    # MRJob Steps
    def steps(self):
        return [
            MRStep(mapper_init = self.mapper_init, 
                   mapper=self.mapper)
               ]
    
    # Load left table from file. We purposefully store the smaller table in memory
    def mapper_init(self):
        self.left_table = {}
        for s in open('./A.data').readlines():
            s = s.split('\n')[0].split(',')
            self.left_table[s[0]] = s[1]          
                
    # Map-side join: inner join
    def mapper(self, _, line):
        right_line = line.split(',')
        if self.left_table.get(right_line[0], "no entry") != "no entry":
            # Yield joined data | Website ID, Website URL, Customer ID
            yield None, '{},{},{}'.format(right_line[0],self.left_table[right_line[0]],right_line[1])
                
    # No reducer needed for memory map-side join
    # We could use the hadoop shuffle to sort (similar to "order by")
    
if __name__ == '__main__':
    InnerJoin.run()

Overwriting InnerJoin.py


In [179]:
# Inner join web and customer data
def HW5_3b():
    from InnerJoin import InnerJoin

    mr_job = InnerJoin(args=['V.data','--strict-protocols','--file','A.data'])
    
    with mr_job.make_runner() as runner, open('InnerJoin.data', 'w') as outfile:
        runner.run()
        for line in runner.stream_output():
            outfile.write(str(mr_job.parse_output_line(line))+'\n')

    # The inner join should yiled fewer results than the left join
    print "There number of observations in the results dataset is:"
    !wc -l ./InnerJoin.data
    print

    print "The first 25 results are:"
    !cat ./InnerJoin.data | head -n25

    print
    print "The last 25 results are:"
    !cat ./InnerJoin.data | tail -n25

HW5_3b()

There number of observations in the results dataset is:
   98654 ./InnerJoin.data

The first 25 results are:
(None, '1000,www.microsoft.com/regwiz,10001')
(None, '1001,www.microsoft.com/support,10001')
(None, '1002,www.microsoft.com/athome,10001')
(None, '1001,www.microsoft.com/support,10002')
(None, '1003,www.microsoft.com/kb,10002')
(None, '1001,www.microsoft.com/support,10003')
(None, '1003,www.microsoft.com/kb,10003')
(None, '1004,www.microsoft.com/search,10003')
(None, '1005,www.microsoft.com/norge,10004')
(None, '1006,www.microsoft.com/misc,10005')
(None, '1003,www.microsoft.com/kb,10006')
(None, '1004,www.microsoft.com/search,10006')
(None, '1007,www.microsoft.com/ie_intl,10007')
(None, '1004,www.microsoft.com/search,10008')
(None, '1008,www.microsoft.com/msdownload,10009')
(None, '1009,www.microsoft.com/windows,10009')
(None, '1010,www.microsoft.com/vbasic,10010')
(None, '1000,www.microsoft.com/regwiz,10010')
(None, '1011,www.microsoft.com/officedev,10010')
(None, '1012,www.mic

 #### Right Join

In [180]:
%%writefile RightJoin.py
from mrjob.job import MRJob
from mrjob.step import MRStep
from itertools import chain

# For right join, we simply need to output once per element in right table as we
# stream it.
class RightJoin(MRJob):

    # MRJob Steps
    def steps(self):
        return [
            MRStep(mapper_init = self.mapper_init, 
                   mapper=self.mapper)
               ]
    
    # Load left table from file. We purposefully store the smaller table in memory
    def mapper_init(self):
        self.left_table = {}
        for s in open('./A.data').readlines():
            s = s.split('\n')[0].split(',')
            self.left_table[s[0]] = s[1]          
                
    # Map-side join: right join
    def mapper(self, _, line):
        right_line = line.split(',')
        # Yield joined data | Website ID, Website URL ('' if missing), Customer ID
        yield None, '{},{},{}'.format(right_line[0],self.left_table.get(right_line[0],''),right_line[1])
                
    # No reducer needed for memory map-side join
    # We could use the hadoop shuffle to sort (similar to "order by")
    
if __name__ == '__main__':
    RightJoin.run()

Overwriting RightJoin.py


In [181]:
# Right join web and customer data
def HW5_3c():
    from RightJoin import RightJoin

    mr_job = RightJoin(args=['V.data','--strict-protocols','--file','A.data'])
    
    with mr_job.make_runner() as runner, open('RightJoin.data', 'w') as outfile:
        runner.run()
        for line in runner.stream_output():
            outfile.write(str(mr_job.parse_output_line(line))+'\n')

    # The inner join should yiled fewer results than the left join
    print "There number of observations in the results dataset is:"
    !wc -l ./RightJoin.data
    print

    print "The first 25 results are:"
    !cat ./RightJoin.data | head -n25

    print
    print "The last 25 results are:"
    !cat ./RightJoin.data | tail -n25

HW5_3c()

There number of observations in the results dataset is:
   98654 ./RightJoin.data

The first 25 results are:
(None, '1000,www.microsoft.com/regwiz,10001')
(None, '1001,www.microsoft.com/support,10001')
(None, '1002,www.microsoft.com/athome,10001')
(None, '1001,www.microsoft.com/support,10002')
(None, '1003,www.microsoft.com/kb,10002')
(None, '1001,www.microsoft.com/support,10003')
(None, '1003,www.microsoft.com/kb,10003')
(None, '1004,www.microsoft.com/search,10003')
(None, '1005,www.microsoft.com/norge,10004')
(None, '1006,www.microsoft.com/misc,10005')
(None, '1003,www.microsoft.com/kb,10006')
(None, '1004,www.microsoft.com/search,10006')
(None, '1007,www.microsoft.com/ie_intl,10007')
(None, '1004,www.microsoft.com/search,10008')
(None, '1008,www.microsoft.com/msdownload,10009')
(None, '1009,www.microsoft.com/windows,10009')
(None, '1010,www.microsoft.com/vbasic,10010')
(None, '1000,www.microsoft.com/regwiz,10010')
(None, '1011,www.microsoft.com/officedev,10010')
(None, '1012,www.mic

<b>HW 5.3</b>

For the remainder of this assignment you will work with a large subset 
of the Google n-grams dataset,

https://aws.amazon.com/datasets/google-books-ngrams/

which we have placed in a bucket on s3:

s3://filtered-5grams/

In particular, this bucket contains (~200) files in the format:

	(ngram) \t (count) \t (pages_count) \t (books_count)

Do some EDA on this dataset using mrjob, e.g., 

- Longest 5-gram (number of characters)
- Top 10 most frequent words (count), i.e., unigrams
- Most/Least densely appearing words (count/pages_count) sorted in decreasing order of relative frequency (Hint: save to PART-000* and take the head -n 1000)
- Distribution of 5-gram sizes (counts) sorted in decreasing order of relative frequency. (Hint: save to PART-000* and take the head -n 1000)

OPTIONAL Question:
- Plot the log-log plot of the frequency distributuion of unigrams. Does it follow power law distribution?

For more background see:
https://en.wikipedia.org/wiki/Log%E2%80%93log_plot
https://en.wikipedia.org/wiki/Power_law

#### Longest 5-gram

In [21]:
%%writefile longest5Gram.py
#!/Library/Frameworks/Python.framework/Versions/2.7/bin/python
from mrjob.job import MRJob
from mrjob.step import MRStep
import re
from itertools import combinations
from mrjob.protocol import RawValueProtocol

class longest5Gram(MRJob):
    
    OUTPUT_PROTOCOL = RawValueProtocol
    
    def jobconf(self):
        orig_jobconf = super(longest5Gram, self).jobconf()        
        custom_jobconf = {
            'mapred.output.key.comparator.class': 'org.apache.hadoop.mapred.lib.KeyFieldBasedComparator',
            'mapred.text.key.comparator.options': '-k1rn',
        }
        combined_jobconf = orig_jobconf
        combined_jobconf.update(custom_jobconf)
        self.jobconf = combined_jobconf
        return combined_jobconf

    def steps(self):
        return [MRStep(mapper = self.mapper, 
                       reducer_init = self.reducer_init,
                       reducer = self.reducer)]

    def mapper(self, _, line):
        line.strip()
        [ngram,count,pages,books] = re.split("\t",line)
        yield len(ngram),ngram

    # Use reducer_init to set up a variable to only output from the reducer once
    # (top result)
    def reducer_init(self):
        self.first = 0
        
    def reducer(self,count,values):
        data = {}
        if self.first < 5:
            self.first += 1
            for ngram in values:
                data[ngram] = count
            yield None,data
#         for ngram in values:
#             yield ngram,count

        

if __name__ == '__main__':
    longest5Gram.run()

Overwriting longest5Gram.py


In [22]:
!chmod +x longest5Gram.py

In [24]:
!aws s3 mb s3://ucb-mids-mls-christopher-llop-hw5/longest5Gram/

make_bucket failed: s3://ucb-mids-mls-christopher-llop-hw5/longest5Gram/ A client error (BucketAlreadyOwnedByYou) occurred when calling the CreateBucket operation: Your previous request to create the named bucket succeeded and you already own it.



In [25]:
!./longest5Gram.py s3://filtered-5grams/ -r emr \
    --output-dir=s3://ucb-mids-mls-christopher-llop-hw5/longest5Gram \
    --no-output

no configs found; falling back on auto-configuration
no configs found; falling back on auto-configuration
using existing scratch bucket mrjob-03e94e1f06830625
using s3://mrjob-03e94e1f06830625/tmp/ as our scratch dir on S3
creating tmp directory /var/folders/6f/lzgvrmnn68b1pdz9lfyc7gnh0000gn/T/longest5Gram.cjllop.20151007.035139.869156
writing master bootstrap script to /var/folders/6f/lzgvrmnn68b1pdz9lfyc7gnh0000gn/T/longest5Gram.cjllop.20151007.035139.869156/b.py

PLEASE NOTE: Starting in mrjob v0.5.0, protocols will be strict by default. It's recommended you run your job with --strict-protocols or set up mrjob.conf as described at https://pythonhosted.org/mrjob/whats-new.html#ready-for-strict-protocols

Copying non-input files into s3://mrjob-03e94e1f06830625/tmp/longest5Gram.cjllop.20151007.035139.869156/files/
Waiting 5.0s for S3 eventual consistency
Creating Elastic MapReduce job flow
Job flow created with ID: j-JO6XBAWVTCEI
Created new job flow j-JO6XBAWVTCEI
Job launched 30.5s 

In [26]:
!aws s3 cp s3://ucb-mids-mls-christopher-llop-hw5/longest5Gram/part-00000 53longest5Gram.txt

download: s3://ucb-mids-mls-christopher-llop-hw5/longest5Gram/part-00000 to ./53longest5Gram.txt


In [31]:
# Print result to screen
!cat ./ 53longest5Gram.txt | head -1

cat: ./: Is a directory
{'ROPLEZIMPREDASTRODONBRASLPKLSON YHROACLMPARCHEYXMMIOUDAVESAURUS PIOFPILOCOWERSURUASOGETSESNEGCP TYRAVOPSIFENGOQUAPIALLOBOSKENUO OWINFUYAIOKENECKSASXHYILPOYNUAT': 159, 'AIOPJUMRXUYVASLYHYPSIBEMAPODIKR UFRYDIUUOLBIGASUAURUSREXLISNAYE RNOONDQSRUNSUBUNOUGRABBERYAIRTC UTAHRAPTOREDILEIPMILBDUMMYUVERI SYEVRAHVELOCYALLOSAURUSLINROTSR': 159}	


#### Most common unigrams

In [27]:
!aws s3 sync s3://filtered-5grams ./filtered-5grams

download: s3://filtered-5grams/googlebooks-eng-all-5gram-20090715-1-filtered.txt to filtered-5grams/googlebooks-eng-all-5gram-20090715-1-filtered.txt
download: s3://filtered-5grams/googlebooks-eng-all-5gram-20090715-0-filtered.txt to filtered-5grams/googlebooks-eng-all-5gram-20090715-0-filtered.txt
download: s3://filtered-5grams/googlebooks-eng-all-5gram-20090715-10-filtered.txt to filtered-5grams/googlebooks-eng-all-5gram-20090715-10-filtered.txt
download: s3://filtered-5grams/googlebooks-eng-all-5gram-20090715-103-filtered.txt to filtered-5grams/googlebooks-eng-all-5gram-20090715-103-filtered.txt
download: s3://filtered-5grams/googlebooks-eng-all-5gram-20090715-102-filtered.txt to filtered-5grams/googlebooks-eng-all-5gram-20090715-102-filtered.txt
download: s3://filtered-5grams/googlebooks-eng-all-5gram-20090715-101-filtered.txt to filtered-5grams/googlebooks-eng-all-5gram-20090715-101-filtered.txt
download: s3://filtered-5grams/googlebooks-eng-all-5gram-20090715-106-filtered.txt to 

In [22]:
%%writefile commonUnigram.py
#!/Library/Frameworks/Python.framework/Versions/2.7/bin/python
from mrjob.job import MRJob
from mrjob.step import MRStep
import re
from operator import itemgetter
#from itertools import combinations
from mrjob.protocol import RawValueProtocol

class commonUnigram(MRJob):
    
    OUTPUT_PROTOCOL = RawValueProtocol
    
    
# Code below works on AWS only
#    def jobconf(self):
#        orig_jobconf = super(commonUnigram, self).jobconf()        
#        custom_jobconf = {
#            'mapred.output.key.comparator.class': 'org.apache.hadoop.mapred.lib.KeyFieldBasedComparator',
#            'mapred.text.key.comparator.options': '-k1rn',
#        }
#        combined_jobconf = orig_jobconf
#        combined_jobconf.update(custom_jobconf)
#        self.jobconf = combined_jobconf
#        return combined_jobconf

    def steps(self):
        return [MRStep(mapper = self.mapper_count, 
                       combiner = self.combiner_count,
                       reducer = self.reducer_count
                       #reducer = self.reducer
                      ),
                MRStep(reducer = self.reducer_rank_uni)
               ]

    def mapper_count(self, _, line):
        line.strip()
        #(ngram) \t (count) \t (pages_count) \t (books_count)
        [ngram,count,pages,books] = re.split("\t",line)
        for word in ngram.lower().split():
            yield word, int(count)

    def combiner_count(self, ngram, counts):
        yield ngram, sum(counts)
        
    def reducer_count(self, ngram, counts):
#        yield ngram, sum(counts)
        yield None, (sum(counts), ngram)        
    
    # This in-memory sort is needed if we are not running on AWS
    def reducer_rank_uni(self, _, visit_counts):
        # Find top 10,000
        top_five = sorted(visit_counts, key=lambda k: -k[0])[:10000]
        # Print each of the top 10000 nicely
        for i, result in enumerate(top_five):
            yield None, "{}\t{}".format(result[1], result[0])

        

if __name__ == '__main__':
    commonUnigram.run()

Overwriting commonUnigram.py


In [23]:
!chmod +x commonUnigram.py

In [29]:
!./commonUnigram.py ./filtered-5grams/googlebooks-eng-all-5gram-20090715-0-filtered.txt \
    --output-dir=34_unigram_count \
    --no-output \

no configs found; falling back on auto-configuration
no configs found; falling back on auto-configuration

PLEASE NOTE: Starting in mrjob v0.5.0, protocols will be strict by default. It's recommended you run your job with --strict-protocols or set up mrjob.conf as described at https://pythonhosted.org/mrjob/whats-new.html#ready-for-strict-protocols

creating tmp directory /var/folders/6f/lzgvrmnn68b1pdz9lfyc7gnh0000gn/T/commonUnigram.cjllop.20151010.225434.348380
writing to /var/folders/6f/lzgvrmnn68b1pdz9lfyc7gnh0000gn/T/commonUnigram.cjllop.20151010.225434.348380/step-0-mapper_part-00000
Counters from step 1:
  (no counters found)
writing to /var/folders/6f/lzgvrmnn68b1pdz9lfyc7gnh0000gn/T/commonUnigram.cjllop.20151010.225434.348380/step-0-mapper-sorted
> sort /var/folders/6f/lzgvrmnn68b1pdz9lfyc7gnh0000gn/T/commonUnigram.cjllop.20151010.225434.348380/step-0-mapper_part-00000
writing to /var/folders/6f/lzgvrmnn68b1pdz9lfyc7gnh0000gn/T/commonUnigram.cjllop.20151010.225434.348380/step-

In [30]:
!./commonUnigram.py ./filtered-5grams \
    --output-dir=34_unigram_count \
    --no-output \

no configs found; falling back on auto-configuration
no configs found; falling back on auto-configuration

PLEASE NOTE: Starting in mrjob v0.5.0, protocols will be strict by default. It's recommended you run your job with --strict-protocols or set up mrjob.conf as described at https://pythonhosted.org/mrjob/whats-new.html#ready-for-strict-protocols

creating tmp directory /var/folders/6f/lzgvrmnn68b1pdz9lfyc7gnh0000gn/T/commonUnigram.cjllop.20151011.002031.467869
writing to /var/folders/6f/lzgvrmnn68b1pdz9lfyc7gnh0000gn/T/commonUnigram.cjllop.20151011.002031.467869/step-0-mapper_part-00000
writing to /var/folders/6f/lzgvrmnn68b1pdz9lfyc7gnh0000gn/T/commonUnigram.cjllop.20151011.002031.467869/step-0-mapper_part-00001
writing to /var/folders/6f/lzgvrmnn68b1pdz9lfyc7gnh0000gn/T/commonUnigram.cjllop.20151011.002031.467869/step-0-mapper_part-00002
writing to /var/folders/6f/lzgvrmnn68b1pdz9lfyc7gnh0000gn/T/commonUnigram.cjllop.20151011.002031.467869/step-0-mapper_part-00003
writing to /var/

In [18]:
!./commonUnigram.py gbooks_filtered_sample.txt \
    --output-dir=1_JSON_out \
    --no-output \

no configs found; falling back on auto-configuration
no configs found; falling back on auto-configuration

PLEASE NOTE: Starting in mrjob v0.5.0, protocols will be strict by default. It's recommended you run your job with --strict-protocols or set up mrjob.conf as described at https://pythonhosted.org/mrjob/whats-new.html#ready-for-strict-protocols

creating tmp directory /var/folders/6f/lzgvrmnn68b1pdz9lfyc7gnh0000gn/T/commonUnigram.cjllop.20151010.221914.111032
writing to /var/folders/6f/lzgvrmnn68b1pdz9lfyc7gnh0000gn/T/commonUnigram.cjllop.20151010.221914.111032/step-0-mapper_part-00000
Counters from step 1:
  (no counters found)
writing to /var/folders/6f/lzgvrmnn68b1pdz9lfyc7gnh0000gn/T/commonUnigram.cjllop.20151010.221914.111032/step-0-mapper-sorted
> sort /var/folders/6f/lzgvrmnn68b1pdz9lfyc7gnh0000gn/T/commonUnigram.cjllop.20151010.221914.111032/step-0-mapper_part-00000
writing to /var/folders/6f/lzgvrmnn68b1pdz9lfyc7gnh0000gn/T/commonUnigram.cjllop.20151010.221914.111032/step-

In [16]:
# Find the visits per page
def HW5_3():
    from commonUnigram import commonUnigram
#    import csv

    mr_job = commonUnigram(args=['gbooks_filtered_sample.txt'])
    with mr_job.make_runner() as runner:
        runner.run()
        for line in runner.stream_output():
            print mr_job.parse_output_line(line)
            
HW5_3()



<b>HW 5.4</b>

In this part of the assignment we will focus on developing methods
for detecting synonyms, using the Google 5-grams dataset. To accomplish
this you must script two main tasks using MRJob:

(1) Build stripes of word co-ocurrence for the top 10,000
most frequently appearing words across the entire set of 5-grams,
and output to a file in your bucket on s3 (bigram analysis, though the words are non-contiguous).

(2) Using two (symmetric) comparison methods of your choice 
(e.g., correlations, distances, similarities), pairwise compare 
all stripes (vectors), and output to a file in your bucket on s3.

==Design notes for (1)==
For this task you will be able to modify the pattern we used in HW 3.2
(feel free to use the solution as reference). To total the word counts 
across the 5-grams, output the support from the mappers using the total 
order inversion pattern:

<*word,count>

to ensure that the support arrives before the cooccurrences.

In addition to ensuring the determination of the total word counts,
the mapper must also output co-occurrence counts for the pairs of
words inside of each 5-gram. Treat these words as a basket,
as we have in HW 3, but count all stripes or pairs in both orders,
i.e., count both orderings: (word1,word2), and (word2,word1), to preserve
symmetry in our output for (2).

==Design notes for (2)==
For this task you will have to determine a method of comparison.
Here are a few that you might consider:

- Spearman correlation
- Euclidean distance
- Taxicab (Manhattan) distance
- Shortest path graph distance (a graph, because our data is symmetric!)
- Pearson correlation
- Cosine similarity
- Kendall correlation
...

However, be cautioned that some comparison methods are more difficult to
parallelize than others, and do not perform more associations than is necessary, 
since your choice of association will be symmetric.


In [6]:
with open("stripes.txt") as infile:
    for line in infile:
        key = line.split("\t")[0]
        value = line.split("\t")[1].strip('"').split(",")
        print len(value)
        break

10000


<b>HW 5.5</b>

In this part of the assignment you will evaluate the success of you synonym detector.
Take the top 1,000 closest/most similar/correlative pairs of words as determined
by your measure in (2), and use the synonyms function in the accompanying
python code:

nltk_synonyms.py

Note: This will require installing the python nltk package:

http://www.nltk.org/install.html

and downloading its data with nltk.download().

For each (word1,word2) pair, check to see if word1 is in the list, 
synonyms(word2), and vice-versa. If one of the two is a synonym of the other, 
then consider this pair a 'hit', and then report the precision, recall, and F1 measure  of 
your detector across your 1,000 best guesses. Report the macro averages of these measures.

<b>HW 5.6 (optional)</b>

There are many good ways to build our synonym detectors, so for optional homework, 
measure co-occurrence by (left/right/all) consecutive words only, 
or make stripes according to word co-occurrences with the accompanying 
2-, 3-, or 4-grams (note here that your output will no longer 
be interpretable as a network) inside of the 5-grams.

<b>Hw 5.7 (optional)</b>

Once again, benchmark your top 1,000 associations (as in 5.5), this time for your
results from 5.6. Has your detector improved?

<b>HW4.0.</b>
What is MrJob? How is it different to Hadoop MapReduce? 
What are the mapper_final(), combiner_final(), reducer_final() methods? When are they called?

<span style="color:green"><b>Answer:</b></span>
1. MrJob is a Python library created originally by Yelp and now open-source. It helps facilitate the use of Hadoop Streaming by providing an intuitive programming interface that abstracts away from the major implementation mechanics of running Hadoop jobs so that the programmer can focus on programming their Mappers and Reducers. The goal is to allow data scientists to focus more on solving their problem and less on technical details of Hadoop and running the job. MrJob also handles aspects of managing the distributed file system and makes it very easy to run multiple jobs in succession as "steps" via MrStep.

2. MrJob leverages Hadoop MapReduce but is much more intuitive to program. Instead of writing mapper and reducer program files, we can write mapper and reducer functions. MrJob also makes it very simple to implement additional methods, such as mapper_final(), combiner_final(), etc. MrJob gives us slightly less control than if we were using Hadoop in a more native form, but many data scientists have found the abstraction to be worth it.

3. The _final() methods allow the programmer to run a bit of code when the relevant segment of the MapReduce framework has completed. They are called on the same node where the step took place, and can interact with the variables that were in memory on that node at the end of the step. For example, an in-memory combiner that is part of a mapper can yield final values in the mapper_final() method.

<b>HW4.1. </b>

What is serialization in the context of MrJob or Hadoop? 
When it used in these frameworks? 
What is the default serialization mode for input and outputs for MrJob? 

<span style="color:green"><b>Answer:</b></span>

1. Serialization is the process of storing data into binary form for use in processing or transmission (instead of text string form). The method of serialization is important because binary data takes up much less space and can be processed much faster than string based data. In a case such as hadoop, smaller data means faster runtime and less network burden. Serialization is used in MrJob and Hadoop to transmit data (in/out). Native Hadoop uses its own serialization framework that is especially compact. This is not available to MrJob.

2. The default serialization mode for input and outputs for MrJob is pickled data. The default  is JSON serialization. MrJob does not have the same kind of compressed serialization options that native Hadoop has. One notable serialization is Pickling, which is relatively compact. Many developers think the programming gains of MrJob are well worth the tradeoff.

<b>HW4.2</b>

Recall the Microsoft logfiles data from the async lecture. The logfiles are described are located at:

https://kdd.ics.uci.edu/databases/msweb/msweb.html
http://archive.ics.uci.edu/ml/machine-learning-databases/anonymous/

This dataset records which areas (Vroots) of www.microsoft.com each user visited in a one-week timeframe in Feburary 1998.

 Here, you must preprocess the data on a single node (i.e., not on a cluster of nodes) from the format:

C,"10001",10001   #Visitor id 10001 <br>
V,1000,1          #Visit by Visitor 10001 to page id 1000 <br>
V,1001,1          #Visit by Visitor 10001 to page id 1001 <br>
V,1002,1          #Visit by Visitor 10001 to page id 1002 <br>
C,"10002",10002   #Visitor id 10001 <br>
V <br>
Note: #denotes comments <br>
to the format: <br>

V,1000,1,C, 10001 <br>
V,1001,1,C, 10001 <br>
V,1002,1,C, 10001 <br>

Write the python code to accomplish this.

<span style="color:green"><b>Answer:</b></span>
The code below sorts through and transforms the data. It is largely similar to code provided in the class materials, but in addition I have standardized the customer and page ID numbers to not be encased in quotes.

In [272]:
# Note: Clean Weblog Data
def HW4_2():
    with open('anonymous-msweb.data', 'r') as infile, open('parsed_msweb.data', 'w') as outfile:
        for line in infile:
            data = line.split(',')
            if data[0] == 'C':
                # Keep case lines in the same format we see them in
                this_customer = data
                outfile.write(this_customer[0]+','+this_customer[2])
            elif data[0] == 'V':
                # Modify vote lines so they can be processed in parallel
                outfile.write(data[0]+','+data[1]+','+this_customer[0]+','+this_customer[2])
            else:
                # All other lines stay as-is (we'll need the 'A' lines later)
                outfile.write(line)
                
    print "The first 50 lines are:"
    !cat ./parsed_msweb.data | head -n50

    print
    print "The last 50 lines are:"
    !cat ./parsed_msweb.data | tail -n50

HW4_2()

The first 50 lines are:
I,4,"www.microsoft.com","created by getlog.pl"
T,1,"VRoot",0,0,"VRoot"
N,0,"0"
N,1,"1"
T,2,"Hide1",0,0,"Hide"
N,0,"0"
N,1,"1"
A,1287,1,"International AutoRoute","/autoroute"
A,1288,1,"library","/library"
A,1289,1,"Master Chef Product Information","/masterchef"
A,1297,1,"Central America","/centroam"
A,1215,1,"For Developers Only Info","/developer"
A,1279,1,"Multimedia Golf","/msgolf"
A,1239,1,"Microsoft Consulting","/msconsult"
A,1282,1,"home","/home"
A,1251,1,"Reference Support","/referencesupport"
A,1121,1,"Microsoft Magazine","/magazine"
A,1083,1,"MS Access Support","/msaccesssupport"
A,1145,1,"Visual Fox Pro Support","/vfoxprosupport"
A,1276,1,"Visual Test Support","/vtestsupport"
A,1200,1,"Benelux Region","/benelux"
A,1259,1,"controls","/controls"
A,1155,1,"Sidewalk","/sidewalk"
A,1092,1,"Visual FoxPro","/vfoxpro"
A,1004,1,"Microsoft.com Search","/search"
A,1057,1,"MS PowerPoint News","/powerpoint"
A,1140,1,"Netherlands (Holland)","/netherlands"
A,1198,1,"Pi

<b>HW 4.3</b> 

Find the 5 most frequently visited pages using mrjob from the output of 4.2 (i.e., transfromed log file).

In [271]:
%%writefile visits_per_page.py

from mrjob.job import MRJob
from mrjob.step import MRStep
from operator import itemgetter

import csv

def csv_readline(line):
    # Parse line
    for row in csv.reader([line]):
        return row

class PageVisit(MRJob):
    def steps(self):
        return [
            MRStep(mapper=self.mapper_count_pviews,
                     combiner=self.combiner_sum_pviews,
                     reducer=self.reducer_sum_pviews),
            MRStep(reducer=self.reducer_rank_pviews)
        ]

    def mapper_count_pviews(self, line_no, line):
        # For vote lines, extract customer information
        cell = csv_readline(line)
        if cell[0] == 'V':
            yield cell[1],1

    def combiner_sum_pviews(self, page_id, visit_counts):
        # Add page visits together
        yield page_id, sum(visit_counts)

    def reducer_sum_pviews(self, page_id, visit_counts):
        # Add page visits together
        yield None, (sum(visit_counts), page_id)
#        yield page_id, sum(visit_counts)
        
    def reducer_rank_pviews(self, _, visit_counts):
        # Find top five
        top_five = sorted(visit_counts, key=lambda k: -k[0])[:5]
        # Print each of the top 5 nicely
        for i, result in enumerate(top_five):
            yield "Top {}:".format(i+1), "Page {} with {} visits".format(result[1], result[0])

if __name__ == '__main__':
    PageVisit.run()

Overwriting visits_per_page.py


In [427]:
# Find the visits per page
def HW4_3():
    from visits_per_page import PageVisit
    import csv

    mr_job = PageVisit(args=['parsed_msweb.data'])
    with mr_job.make_runner() as runner:
        runner.run()
        for line in runner.stream_output():
            print mr_job.parse_output_line(line)
            
HW4_3()



('Top 1:', 'Page 1008 with 10836 visits')
('Top 2:', 'Page 1034 with 9383 visits')
('Top 3:', 'Page 1004 with 8463 visits')
('Top 4:', 'Page 1018 with 5330 visits')
('Top 5:', 'Page 1017 with 5108 visits')


<b>HW4.4</b>


Find the most frequent visitor of each page using mrjob and the output of 4.2  (i.e., transfromed log file). In this output please include the webpage URL, webpageID and Visitor ID.


In [275]:
%%writefile freq_visitor.py

# Libraries
from mrjob.job import MRJob
from mrjob.step import MRStep
from operator import itemgetter

import csv

# Function to parse original CSV lines
def csv_readline(line):
    # Parse line
    for row in csv.reader([line]):
        return row

class FreqVisitor(MRJob):
    # Run two MapReduce steps
    def steps(self):
        return [
            MRStep(mapper=self.mapper_count_pviews,
                     combiner=self.reducer_sum_pviews,
                     reducer=self.reducer_sum_pviews),
            MRStep(mapper=self.mapper_rank_pviews,
                  reducer=self.reducer_rank_pviews)
        ]

    # First, create counts of page-customer combos. Also save the web address
    # for later use
    def mapper_count_pviews(self, line_no, line):
        # For vote lines, extract customer information
        cell = csv_readline(line)
        if cell[0] == 'V':
            yield (cell[1], cell[3]) , 1
        if cell[0] == 'A':
            yield (cell[1], "*"), 'www.microsoft.com'+cell[4]

    # Next, reduce to get sums of page-customer combos. Pass web addresses
    # through for later use
    def reducer_sum_pviews(self, page_cust, visit_counts):
        # Add page visits together
        if page_cust[1] != "*":
            yield page_cust, sum(visit_counts)
        else:
            for x in visit_counts:
                yield page_cust, x
        
    # In second pass, adjust keys to properly sort web addresses with count
    # totals
    def mapper_rank_pviews(self, page_cust, visit_counts):
        yield page_cust[0], (visit_counts, page_cust[1])
        
    # Create final entries of max page and web address
    def reducer_rank_pviews(self, page_id, visit_counts):
        
        # Set defaults in case no data is present for a key
        max_count = 0
        cust_list = []
        url = ""
        
        # Iterate through values for each key
        for x in visit_counts:
            # If the entry is flagged as a URL with *, then save the URL
            if x[1] == "*":
                url = x[0]
            else:
                # Otherwise, if the customer has a new max number of page views,
                # replace the existing max customer with this new customer
                if x[0] > max_count:
                    max_count = x[0]
                    cust_list = [x[1]]
                # If there is a tie, append the tieing customer to the max customer
                # list
                elif x[0] == max_count:
                    max_count = x[0]
                    cust_list += [x[1]]
        # Yield key of page_id, value of results
        yield page_id, (url, max_count, cust_list)

if __name__ == '__main__':
    FreqVisitor.run()

Overwriting freq_visitor.py


In [428]:
# Find most frequent visitor for each page
def HW4_4():
    from freq_visitor import FreqVisitor
    import csv

    mr_job = FreqVisitor(args=['parsed_msweb.data','--strict-protocols'])
    with mr_job.make_runner() as runner, open('FreqVisitorOut.data', 'w') as outfile:
        runner.run()
        for line in runner.stream_output():
            outfile.write(str(mr_job.parse_output_line(line))+'\n')

    print "The first 50 results are:"
    !cat ./FreqVisitorOut.data | head -n50

    print
    print "The last 50 results are:"
    !cat ./FreqVisitorOut.data | tail -n50

HW4_4()

The first 50 results are:
('1000', ['www.microsoft.com/regwiz', 1, ['10001', '10010', '10039', '10073', '10087', '10101', '10132', '10141', '10154', '10162', '10166', '10201', '10218', '10220', '10324', '10348', '10376', '10384', '10409', '10429', '10454', '10457', '10471', '10497', '10511', '10520', '10541', '10564', '10599', '10752', '10756', '10861', '10935', '10943', '10969', '11027', '11050', '11410', '11429', '11440', '11490', '11501', '11528', '11539', '11544', '11685', '11695', '11723', '11766', '11774', '11779', '11898', '11964', '12017', '12020', '12035', '12086', '12123', '12143', '12155', '12201', '12220', '12228', '12262', '12273', '12306', '12315', '12324', '12337', '12343', '12400', '12415', '12484', '12485', '12537', '12571', '12583', '12674', '12700', '12740', '12815', '12853', '12893', '12897', '12930', '12944', '12970', '12982', '13015', '13049', '13079', '13080', '13085', '13128', '13176', '13197', '13223', '13248', '13275', '13294', '13322', '13342', '13365', '1341

<b>HW 4.5</b>

Here you will use a different dataset consisting of word-frequency distributions 
for 1,000 Twitter users. These Twitter users use language in very different ways,
and were classified by hand according to the criteria:

0: Human, where only basic human-human communication is observed.

1: Cyborg, where language is primarily borrowed from other sources
(e.g., jobs listings, classifieds postings, advertisements, etc...).

2: Robot, where language is formulaically derived from unrelated sources
(e.g., weather/seismology, police/fire event logs, etc...).

3: Spammer, where language is replicated to high multiplicity
(e.g., celebrity obsessions, personal promotion, etc... )

Check out the preprints of our recent research,
which spawned this dataset:

http://arxiv.org/abs/1505.04342 <br>
http://arxiv.org/abs/1508.01843

The main data lie in the accompanying file:

topUsers_Apr-Jul_2014_1000-words.txt

and are of the form:

USERID,CODE,TOTAL,WORD1_COUNT,WORD2_COUNT,... <br>
.
.

where <br>

USERID = unique user identifier <br>
CODE = 0/1/2/3 class code <br>
TOTAL = sum of the word counts <br>

Using this data, you will implement a 1000-dimensional K-means algorithm on the users
by their 1000-dimensional word stripes/vectors using several 
centroid initializations and values of K.

Note that each "point" is a user as represented by 1000 words, and that
word-frequency distributions are generally heavy-tailed power-laws
(often called Zipf distributions), and are very rare in the larger class
of discrete, random distributions. For each user you will have to normalize
by its "TOTAL" column. Try several parameterizations and initializations:

(A) K=4 uniform random centroid-distributions over the 1000 words <br>
(B) K=2 perturbation-centroids, randomly perturbed from the aggregated (user-wide) distribution  <br>
(C) K=4 perturbation-centroids, randomly perturbed from the aggregated (user-wide) distribution  <br>
(D) K=4 "trained" centroids, determined by the sums across the classes. <br>

and iterate until a threshold (try 0.001) is reached.
After convergence, print out a summary of the classes present in each cluster.
In particular, report the composition as measured by the total
portion of each class type (0-3) contained in each cluster,
and discuss your findings and any differences in outcomes across parts A-D.

Note that you do not have to compute the aggregated distribution or the 
class-aggregated distributions, which are rows in the auxiliary file:

topUsers_Apr-Jul_2014_1000-words_summaries.txt

<span style="color:green"><b>Answer:</b></span>
Our solution has three different MrJob codes:
1. The first job code runs a quick EDA to find out the number of individuals in each class in the total dataset. This provides context when evaluating the size of our final clusters
2. The second job code is a modified version of the KMeans code provided in class that can run in 1000 dimensions and deal with other nits of this data (normalization, etc.)
3. The third job code uses the centroids of the second job to classify and report the number of each twitter tag that was sorted into each centroid bucket.

Our EDA shows us that there are 751 humans, 91 cyborgs, 54 robots, and 103 spammers. Thus, we should not be surprised if our final clusters has one larger than the other (ideally, we would come close to matching these figures without overfitting)

As an end result, it looks at a glance like method D is working the best. This makes intuitive sense, because we use class-specific information to create the starting centroids in part D.

#### EDA

In [424]:
%%writefile Kmeans_EDA.py
from numpy import argmin, array, random
from mrjob.job import MRJob
from mrjob.step import MRStep
from itertools import chain



class MRKmeansEDA(MRJob):
    centroid_points=[]
    def steps(self):
        return [
            MRStep(
                   mapper=self.mapper,
                   combiner = self.reducer,
                   reducer=self.reducer)
               ]
        
    #load data and output the nearest centroid index and data point 
    def mapper(self, _, line):
        # Note - adjust so that mapper ignores first few values
        value = (map(float,line.split(',')))[1]
        yield value, 1
        
    #Reduce sum of data points 
    def reducer(self, idx, inputdata):
        yield idx, sum(inputdata)

if __name__ == '__main__':
    MRKmeansEDA.run()

Writing Kmeans_EDA.py


In [425]:
import numpy as np
from Kmeans_EDA import MRKmeansEDA

# Now print predictions
mr_job = MRKmeansEDA(args=['topUsers_Apr-Jul_2014_1000-words.txt','--strict-protocols'])
with mr_job.make_runner() as runner: 
    runner.run()
    # stream_output: get access of the output 
    for line in runner.stream_output():
        key,value =  mr_job.parse_output_line(line)
        print key, value


0.0 752
1.0 91
2.0 54
3.0 103


#### KMeans

In [419]:
%%writefile Kmeans.py
from numpy import argmin, array, random
from mrjob.job import MRJob
from mrjob.step import MRStep
from itertools import chain

#Calculate find the nearest centroid for data point 
def MinDist(datapoint, centroid_points):
    datapoint = array(datapoint)
    centroid_points = array(centroid_points)
    diff = datapoint - centroid_points 
    diffsq = diff*diff
    # Get the nearest centroid for each instance
    minidx = argmin(list(diffsq.sum(axis = 1)))
    return minidx

#Check whether centroids converge
def stop_criterion(centroid_points_old, centroid_points_new,T):
    oldvalue = list(chain(*centroid_points_old))
    newvalue = list(chain(*centroid_points_new))
    Diff = [abs(x-y) for x, y in zip(oldvalue, newvalue)]
    Flag = True
    for i in Diff:
        if(i>T):
            Flag = False
            break
    return Flag

class MRKmeans(MRJob):
    centroid_points=[]
    def steps(self):
        return [
            MRStep(mapper_init = self.mapper_init, 
                   mapper=self.mapper,
                   combiner = self.combiner,
                   reducer=self.reducer)
               ]
    #load centroids info from file
    def mapper_init(self):
        self.centroid_points = [map(float,s.split('\n')[0].split(',')) for s in open('/tmp/Centroids.txt').readlines()]
        open('/tmp/Centroids.txt', 'w').close()
        
    #load data and output the nearest centroid index and data point 
    def mapper(self, _, line):
        # Note - adjust so that mapper ignores first few values
        D = (map(float,line.split(',')))[3:]
        tot_len = (map(float,line.split(',')))[2]
        D = [x / tot_len for x in D]
        yield int(MinDist(D,self.centroid_points)), (D,1)
        
    #Combine sum of data points locally
    def combiner(self, idx, inputdata):
        num = 0
        sumx = []
        for x,n in inputdata:
            num = num + n
            if sumx == []:
                sumx = x
            else:
                sumx = [a + b for a, b in zip(sumx, x)]
        yield idx,(sumx,num)
        
    #Aggregate sum for each cluster and then calculate the new centroids
    def reducer(self, idx, inputdata): 
        num = 0
        sumx = []
        for x,n in inputdata:
            num = num + n
            if sumx == []:
                sumx = x
            else:
                sumx = [a + b for a, b in zip(sumx, x)]
                
        sumx = [x / num for x in sumx]
        with open('/tmp/Centroids.txt', 'a') as f:
            f.writelines(",".join( repr(a) for a in sumx ) + '\n')
        yield idx,sumx

if __name__ == '__main__':
    MRKmeans.run()

Overwriting Kmeans.py


In [420]:
%%writefile Kmeans_Accuracy.py
from numpy import argmin, array, random
from mrjob.job import MRJob
from mrjob.step import MRStep
from itertools import chain

#Calculate find the nearest centroid for data point 
def MinDist(datapoint, centroid_points):
    datapoint = array(datapoint)
    centroid_points = array(centroid_points)
    diff = datapoint - centroid_points 
    diffsq = diff*diff
    # Get the nearest centroid for each instance
    minidx = argmin(list(diffsq.sum(axis = 1)))
    return minidx


class MRKmeansAccuracy(MRJob):
    centroid_points=[]
    def steps(self):
        return [
            MRStep(mapper_init = self.mapper_init, 
                   mapper=self.mapper,
                   combiner = self.combiner,
                   reducer=self.reducer)
               ]
    #load centroids info from file
    def mapper_init(self):
        self.centroid_points = [map(float,s.split('\n')[0].split(',')) for s in open('/tmp/Centroids.txt').readlines()]
        open('/tmp/Centroids.txt', 'w').close()
        
    #load data and output the nearest centroid index and data point 
    def mapper(self, _, line):
        # This time, we use the mapper to get the final mappings of each node
        D = (map(float,line.split(',')))[3:]
        # Yeild keys of (cluster_ID, true_ID)
        yield (int(MinDist(D,self.centroid_points)), (map(float,line.split(',')))[1]), 1
        
    #Combine sum of data points locally
    def combiner(self, idx, inputdata):
        yield idx, sum(inputdata)
        
    #Aggregate sum for each cluster and then output classifications
    def reducer(self, idx, inputdata): 
        yield "Cluster {}, Twitter-Type {}".format(idx[0], idx[1]), sum(inputdata)

if __name__ == '__main__':
    MRKmeansAccuracy.run()

Overwriting Kmeans_Accuracy.py


In [481]:
# LETS MAKE SOME CENTROIDS!!!!
# Launch 4 different versions of KMeans. I'm really proud of this...

def HW4_5():
    import numpy as np
    from Kmeans import MRKmeans, stop_criterion
    from Kmeans_Accuracy import MRKmeansAccuracy
    mr_job = MRKmeans(args=['topUsers_Apr-Jul_2014_1000-words.txt','--strict-protocols'])

    # Set convergence threshold
    converge_thresh = 0.001


    #(A) K=4 uniform random centroid-distributions over the 1000 words 
    print "Starting Part A....."
    k = 4
    random_centroids = np.random.uniform(0,1,(k,1000))
    centroids_sums = np.sum(random_centroids, axis=1)
    normalized_centroids = np.divide(random_centroids, centroids_sums[:,np.newaxis])

    np.savetxt('/tmp/Centroids.txt', normalized_centroids, 
               fmt='%.16f', delimiter=',', newline='\n')
    centroid_points = normalized_centroids.tolist()

    # Update centroids iteratively
    i = 0
    while(1):
        # save previous centoids to check convergency
        centroid_points_old = centroid_points[:]
        print "iteration"+str(i)
        with mr_job.make_runner() as runner: 
            runner.run()
            # stream_output: get access of the output 
            for line in runner.stream_output():
                key,value =  mr_job.parse_output_line(line)
                #print key, value
                centroid_points[key] = value
        i = i + 1
        if(stop_criterion(centroid_points_old,centroid_points,converge_thresh)):
            break
    print "Ran {} iterations\n".format(i)

    # Now print predictions
    mr_job = MRKmeansAccuracy(args=['topUsers_Apr-Jul_2014_1000-words.txt','--strict-protocols'])
    with mr_job.make_runner() as runner: 
        runner.run()
        # stream_output: get access of the output 
        for line in runner.stream_output():
            key,value =  mr_job.parse_output_line(line)
            print key, value


    ####################
    #(B) K=2 perturbation-centroids, randomly perturbed from the aggregated (user-wide) distribution 
    print
    print "Starting Part B....."
    mr_job = MRKmeans(args=['topUsers_Apr-Jul_2014_1000-words.txt','--strict-protocols'])

    with open("topUsers_Apr-Jul_2014_1000-words_summaries.txt") as infile:
        filedata = infile.readlines()
        totalrow = filedata[1].split(',')

        centroids1 = [((float(x)/ float(totalrow[2])) +np.random.uniform(0,1)*0.001) for x in totalrow[3:]] 
        centroids2 = [((float(x)/ float(totalrow[2])) +np.random.uniform(0,1)*0.001) for x in totalrow[3:]] 
        centroids3 = [((float(x)/ float(totalrow[2])) +np.random.uniform(0,1)*0.001) for x in totalrow[3:]] 
        centroids4 = [((float(x)/ float(totalrow[2])) +np.random.uniform(0,1)*0.001) for x in totalrow[3:]] 

    centroid_points = [centroids1, centroids2]    
    with open('/tmp/Centroids.txt', 'w+') as f:
        f.writelines(','.join(str(j) for j in i) + '\n' for i in centroid_points)

    # Update centroids iteratively
    i = 0
    while(1):
        # save previous centoids to check convergency
        centroid_points_old = centroid_points[:]
        print "iteration"+str(i)
        with mr_job.make_runner() as runner: 
            runner.run()
            # stream_output: get access of the output 
            for line in runner.stream_output():
                key,value =  mr_job.parse_output_line(line)
                #print key, value
                centroid_points[key] = value
        i = i + 1
        if(stop_criterion(centroid_points_old,centroid_points,converge_thresh)):
            break
    print "Ran {} iterations\n".format(i)

    # Now print predictions
    mr_job = MRKmeansAccuracy(args=['topUsers_Apr-Jul_2014_1000-words.txt','--strict-protocols'])
    with mr_job.make_runner() as runner: 
        runner.run()
        # stream_output: get access of the output 
        for line in runner.stream_output():
            key,value =  mr_job.parse_output_line(line)
            print key, value

    ####################
    #(C) K=4 perturbation-centroids, randomly perturbed from the aggregated (user-wide) distribution 
    print
    print "Starting Part C....."
    mr_job = MRKmeans(args=['topUsers_Apr-Jul_2014_1000-words.txt','--strict-protocols'])

    centroid_points = [centroids1, centroids2, centroids3, centroids4]    
    with open('/tmp/Centroids.txt', 'w+') as f:
        f.writelines(','.join(str(j) for j in i) + '\n' for i in centroid_points)

    # Update centroids iteratively
    i = 0
    while(1):
        # save previous centoids to check convergency
        centroid_points_old = centroid_points[:]
        print "iteration"+str(i)
        with mr_job.make_runner() as runner: 
            runner.run()
            # stream_output: get access of the output 
            for line in runner.stream_output():
                key,value =  mr_job.parse_output_line(line)
                #print key, value
                centroid_points[key] = value
        i = i + 1
        if(stop_criterion(centroid_points_old,centroid_points,converge_thresh)):
            break
    print "Ran {} iterations\n".format(i)

    # Now print predictions
    mr_job = MRKmeansAccuracy(args=['topUsers_Apr-Jul_2014_1000-words.txt','--strict-protocols'])
    with mr_job.make_runner() as runner: 
        runner.run()
        # stream_output: get access of the output 
        for line in runner.stream_output():
            key,value =  mr_job.parse_output_line(line)
            print key, value


    ####################
    #(D) K=4 "trained" centroids, determined by the sums across the classes. 
    print
    print "Starting Part D....."
    mr_job = MRKmeans(args=['topUsers_Apr-Jul_2014_1000-words.txt','--strict-protocols'])

    # Load centroid data
    with open("topUsers_Apr-Jul_2014_1000-words_summaries.txt") as infile:
        filedata = infile.readlines()
        datarow1 = filedata[2].split(',')
        datarow2 = filedata[3].split(',')
        datarow3 = filedata[4].split(',')
        datarow4 = filedata[5].split(',')

        centroids1 = [((float(x)/ float(datarow1[2]))) for x in datarow1[3:]] 
        centroids2 = [((float(x)/ float(datarow2[2]))) for x in datarow2[3:]] 
        centroids3 = [((float(x)/ float(datarow3[2]))) for x in datarow3[3:]] 
        centroids4 = [((float(x)/ float(datarow4[2]))) for x in datarow4[3:]] 

    centroid_points = [centroids1, centroids2, centroids3, centroids4]    
    with open('/tmp/Centroids.txt', 'w+') as f:
        f.writelines(','.join(str(j) for j in i) + '\n' for i in centroid_points)

    # Update centroids iteratively
    i = 0
    while(1):
        # save previous centoids to check convergency
        centroid_points_old = centroid_points[:]
        print "iteration"+str(i)
        with mr_job.make_runner() as runner: 
            runner.run()
            # stream_output: get access of the output 
            for line in runner.stream_output():
                key,value =  mr_job.parse_output_line(line)
                #print key, value
                centroid_points[key] = value
        i = i + 1
        if(stop_criterion(centroid_points_old,centroid_points,converge_thresh)):
            break
    print "Ran {} iterations\n".format(i)

    # Now print predictions
    mr_job = MRKmeansAccuracy(args=['topUsers_Apr-Jul_2014_1000-words.txt','--strict-protocols'])
    with mr_job.make_runner() as runner: 
        runner.run()
        # stream_output: get access of the output 
        for line in runner.stream_output():
            key,value =  mr_job.parse_output_line(line)
            print key, value
            
HW4_5()


Starting Part A.....
iteration0
iteration1
iteration2
iteration3
iteration4
iteration5
iteration6
iteration7
iteration8
iteration9
Ran 10 iterations

Cluster 0, Twitter-Type 0.0 66
Cluster 0, Twitter-Type 1.0 40
Cluster 0, Twitter-Type 2.0 40
Cluster 0, Twitter-Type 3.0 50
Cluster 1, Twitter-Type 1.0 51
Cluster 2, Twitter-Type 2.0 13
Cluster 3, Twitter-Type 0.0 686
Cluster 3, Twitter-Type 2.0 1
Cluster 3, Twitter-Type 3.0 53

Starting Part B.....
iteration0
iteration1
iteration2
iteration3
iteration4
Ran 5 iterations

Cluster 0, Twitter-Type 0.0 727
Cluster 0, Twitter-Type 1.0 2
Cluster 0, Twitter-Type 2.0 3
Cluster 0, Twitter-Type 3.0 77
Cluster 1, Twitter-Type 0.0 25
Cluster 1, Twitter-Type 1.0 89
Cluster 1, Twitter-Type 2.0 51
Cluster 1, Twitter-Type 3.0 26

Starting Part C.....
iteration0
iteration1
iteration2
iteration3
Ran 4 iterations

Cluster 0, Twitter-Type 1.0 51
Cluster 0, Twitter-Type 2.0 5
Cluster 1, Twitter-Type 0.0 54
Cluster 1, Twitter-Type 1.0 40
Cluster 1, Twitter-Typ

This concludes HW 4.0. Thanks for reading!