We created the data by sampling and processing the www.microsoft.com logs. The data records the use of www.microsoft.com by 38000 anonymous, randomly-selected users. For each user, the data lists all the areas of the web site (Vroots) that user visited in a one week timeframe.

Users are identified only by a sequential number, for example, User #14988, User #14989, etc. The file contains no personally identifiable information. The 294 Vroots arc identified by their title (e.g. "NetShow for PowerPoint") and URL (e.g. "/stream"). The data comes from one week in February, 1998.

Dataset format:

  -- The data is in an ASCII-based sparse-data format called "DST".
     Each line of the data file starts with a letter which tells the line's type.
     The three line types of interest arc:
         -- Attribute lines:
             For example, 'A,1277,1,"NetShow for PowerPoint","/stream"
             Where:
               'A' marks this as an attribute line,
               '1277' is the attribute ID number for an area of the website
                     (called a Vroot),
               '1' may be ignored,
               "NetShow for PowerPoint"' is the title of the vroot
               1n/stream"' is the URL relative to "http://www.microsoft.com"
         -- Case and Vote Lines:
             For each user, there is a case line followed by zero or more vote lines.
              For example:
                  C,"10164",10164
                  V,1123,1
                  V,1009,1
                  V,1052,1
              Where:
                  'C' marks this as a case line,
                   '10164' is the case ID number of a user,
                  'V' marks the vote lines for this case,
                  '1123', 1009', 1052' are the attributes ID's of Vroots that a
                       user visited.
                   '1' may be ignored.

In [2]:
%%writefile top_pages.py
"""Find Vroots with more than 400 visits.

This program will take a CSV data file and output tab-seperated lines of

    Vroot -> number of visits

To run:

    python top_pages.py anonymous-msweb.data

To store output:

    python top_pages.py anonymous-msweb.data > top_pages.out
"""
from mrjob.job import MRJob
import csv

def csv_readline(line):
    """Given a sting CSV line, return a list of strings."""
    for row in csv.reader([line]):
        return row

class TopPages(MRJob):

    def mapper(self, line_no, line):
        """Extracts the Vroot that was visited"""
        cell = csv_readline(line)
        if cell[0] == 'V':
            yield cell[1],1
                  # What  Key, Value  do we want to output?

    def reducer(self, vroot, visit_counts):
        """Sumarizes the visit counts by adding them together.  If total visits
        is more than 400, yield the results"""
        total = sum(i for i in visit_counts)
        if total > 400:
            yield vroot, total
        
if __name__ == '__main__':
    TopPages.run()

Overwriting top_pages.py


In [3]:
!chmod a+x top_pages.py

In [4]:
!python top_pages.py anonymous-msweb.data

no configs found; falling back on auto-configuration
no configs found; falling back on auto-configuration
creating tmp directory /var/folders/z_/rfp5q2cd6db13d19v6yw0n8w0000gn/T/top_pages.rcordell.20160202.224559.802192

PLEASE NOTE: Starting in mrjob v0.5.0, protocols will be strict by default. It's recommended you run your job with --strict-protocols or set up mrjob.conf as described at https://pythonhosted.org/mrjob/whats-new.html#ready-for-strict-protocols

writing to /var/folders/z_/rfp5q2cd6db13d19v6yw0n8w0000gn/T/top_pages.rcordell.20160202.224559.802192/step-0-mapper_part-00000
Counters from step 1:
  (no counters found)
writing to /var/folders/z_/rfp5q2cd6db13d19v6yw0n8w0000gn/T/top_pages.rcordell.20160202.224559.802192/step-0-mapper-sorted
> sort /var/folders/z_/rfp5q2cd6db13d19v6yw0n8w0000gn/T/top_pages.rcordell.20160202.224559.802192/step-0-mapper_part-00000
writing to /var/folders/z_/rfp5q2cd6db13d19v6yw0n8w0000gn/T/top_pages.rcordell.20160202.224559.802192/step-0-reducer_

In [5]:
from top_pages import TopPages
import csv

mr_job = TopPages(args=['anonymous-msweb.data'])
with mr_job.make_runner() as runner:
    runner.run()
    for line in runner.stream_output():
        print mr_job.parse_output_line(line)



('1000', 912)
('1001', 4451)
('1002', 749)
('1003', 2968)
('1004', 8463)
('1007', 865)
('1008', 10836)
('1009', 4628)
('1010', 698)
('1014', 728)
('1017', 5108)
('1018', 5330)
('1020', 1087)
('1024', 521)
('1025', 2123)
('1026', 3220)
('1027', 507)
('1030', 1115)
('1031', 574)
('1032', 1446)
('1034', 9383)
('1035', 1791)
('1036', 759)
('1037', 1160)
('1038', 1110)
('1040', 1506)
('1041', 1500)
('1045', 474)
('1046', 636)
('1052', 842)
('1053', 670)
('1058', 672)
('1067', 548)
('1070', 602)
('1074', 584)
('1076', 444)
('1078', 462)
('1295', 716)
