>### HW 9.0: Short answer questions

>What is PageRank and what is it used for in the context of web search?  
>What modifications have to be made to the webgraph in order to leverage the machinery of Markov Chains to 
compute the steady stade distibuton?  
OPTIONAL: In topic-specific pagerank, how can we insure that the irreducible property is satified? (HINT: see HW9.4)



>### HW 9.1: MRJob implementation of basic PageRank

>Write a basic MRJob implementation of the iterative PageRank algorithm
that takes sparse adjacency lists as input (as explored in HW 7).
Make sure that you implementation utilizes teleportation (1-damping/the number of nodes in the network), 
and further, distributes the mass of dangling nodes with each iteration
so that the output of each iteration is correctly normalized (sums to 1).
[NOTE: The PageRank algorithm assumes that a random surfer (walker), starting from a random web page,
chooses the next page to which it will move by clicking at random, with probability d,
one of the hyperlinks in the current page. This probability is represented by a so-called
‘damping factor’ d, where d ∈ (0, 1). Otherwise, with probability (1 − d), the surfer
jumps to any web page in the network. If a page is a dangling end, meaning it has no
outgoing hyperlinks, the random surfer selects an arbitrary web page from a uniform
distribution and “teleports” to that page]


>As you build your code, use the test data  

>s3://ucb-mids-mls-networks/PageRank-test.txt
Or under the Data Subfolder for HW7 on Dropbox with the same file name. 
(On Dropbox https://www.dropbox.com/sh/2c0k5adwz36lkcw/AAAAKsjQfF9uHfv-X9mCqr9wa?dl=0)

>with teleportation parameter set to 0.15 (1-d, where d, the damping factor is set to 0.85), and crosscheck
your work with the true result, displayed in the first image
in the Wikipedia article:

>https://en.wikipedia.org/wiki/PageRank

>and here for reference are the corresponding PageRank probabilities:

>A,0.033  
B,0.384  
C,0.343  
D,0.039  
E,0.081  
F,0.039  
G,0.016  
H,0.016  
I,0.016  
J,0.016  
K,0.016  

In [62]:
%%writefile mrpagerank.py
from mrjob.job import MRJob
from mrjob.protocol import JSONProtocol
from mrjob.step import MRStep

class MRPageRank(MRJob):
    
    INPUT_PROTOCOL = JSONProtocol  # read the same format we write
    OUTPUT_PROTOCOL = JSONProtocol
    
    def configure_options(self):
        super(MRPageRank, self).configure_options()
        
        self.add_passthrough_option(
            '--iterations', dest='iterations', default=10, type='int',
            help='number of iterations to run')

        self.add_passthrough_option(
            '--damping-factor', dest='damping_factor', default=0.85,
            type='float',
            help='probability a web surfer will continue clicking on links')
    
    def send_score(self, node_id, node):
        """Mapper: send score from a single node to other nodes.
        Input: ``node_id, node``
        Output:
        ``node_id, ('node', node)`` OR
        ``node_id, ('score', score)``
        """
        yield node_id, ('node', node)

        for dest_id, weight in node.get('links') or []:
            yield dest_id, ('score', node['score'] * weight)
    
    def receive_score(self, node_id, typed_values):
        """Reducer: Combine scores sent from other nodes, and update this node
        (creating it if necessary).
        Store information about the node's previous score in *prev_score*.
        """
        node = {}
        total_score = 0
        
        for value_type, value in typed_values:
            if value_type == 'node':
                node = value
            else:
                assert value_type == 'score'
                total_score += value
        
        if node:
            node['prev_score'] = node['score']
        else: 
            links = {}
            node['score'] = 1
            node['prev_score'] = 1

        d = self.options.damping_factor
        node['score'] = 1 - d + d * total_score
        
        yield node_id, node
    
    def steps(self):
        return ([MRStep(mapper=self.send_score, reducer=self.receive_score)] *
                self.options.iterations)

if __name__ == '__main__':
    MRPageRank.run()

Overwriting mrpagerank.py


In [63]:
%%writefile format_nodes.py
from mrjob.job import MRJob
from mrjob.step import MRStep
from mrjob.protocol import JSONProtocol, RawProtocol

class FormatNodes(MRJob):
    OUTPUT_PROTOCOL = JSONProtocol
    INPUT_PROTOCOL = RawProtocol
    
    def mapper(self, node_id, links):
        links = eval(links)
        node = {}
        if links is not None:
            node['links'] = sorted(links.items())
        else:
            node['links'] = {}

        node['score'] = 1
        for link in node['links']:
            yield _, (link, {})
        
        yield node_id, node
    
    def reducer_init(self):
        self.total_count = 0
    
    def reducer(self, node_id, nodelist):
        self.total_count += 1
        links = {}
        score = 0
        prev_score = 0
        for node in nodelist:
            node_links = nodelist['links']
            links.append(node_links)
            score = max(score, nodelist['score'])
            prev_score = max(score, nodelist['score'])
        yield node_id, links
        
    
    def reducer_final
    
    def steps(self):
        return ([MRStep(mapper=self.mapper)])

if __name__ == '__main__':
    FormatNodes.run()

Overwriting format_nodes.py


In [68]:
!head ./Data/PageRank-test.txt

B	{'C': 1}
C	{'B': 1}
D	{'A': 1, 'B': 1}
E	{'D': 1, 'B': 1, 'F': 1}
F	{'B': 1, 'E': 1}
G	{'B': 1, 'E': 1}
H	{'B': 1, 'E': 1}
I	{'B': 1, 'E': 1}
J	{'E': 1}
K	{'E': 1}


In [64]:
!python format_nodes.py ./Data/PageRank-test.txt > pg_formatted.json

no configs found; falling back on auto-configuration
no configs found; falling back on auto-configuration
creating tmp directory /var/folders/f8/70g84lyd3n387wjwhbfq4dp80000gn/T/format_nodes.bshur.20160310.182344.485298

PLEASE NOTE: Starting in mrjob v0.5.0, protocols will be strict by default. It's recommended you run your job with --strict-protocols or set up mrjob.conf as described at https://pythonhosted.org/mrjob/whats-new.html#ready-for-strict-protocols

writing to /var/folders/f8/70g84lyd3n387wjwhbfq4dp80000gn/T/format_nodes.bshur.20160310.182344.485298/step-0-mapper_part-00000
Counters from step 1:
  (no counters found)
Moving /var/folders/f8/70g84lyd3n387wjwhbfq4dp80000gn/T/format_nodes.bshur.20160310.182344.485298/step-0-mapper_part-00000 -> /var/folders/f8/70g84lyd3n387wjwhbfq4dp80000gn/T/format_nodes.bshur.20160310.182344.485298/output/part-00000
Streaming final output from /var/folders/f8/70g84lyd3n387wjwhbfq4dp80000gn/T/format_nodes.bshur.20160310.182344.485298

In [66]:
!python mrpagerank.py --iterations=50 pg_formatted.json

no configs found; falling back on auto-configuration
no configs found; falling back on auto-configuration
creating tmp directory /var/folders/f8/70g84lyd3n387wjwhbfq4dp80000gn/T/mrpagerank.bshur.20160310.182415.688646

PLEASE NOTE: Starting in mrjob v0.5.0, protocols will be strict by default. It's recommended you run your job with --strict-protocols or set up mrjob.conf as described at https://pythonhosted.org/mrjob/whats-new.html#ready-for-strict-protocols

writing to /var/folders/f8/70g84lyd3n387wjwhbfq4dp80000gn/T/mrpagerank.bshur.20160310.182415.688646/step-0-mapper_part-00000
Counters from step 1:
  (no counters found)
writing to /var/folders/f8/70g84lyd3n387wjwhbfq4dp80000gn/T/mrpagerank.bshur.20160310.182415.688646/step-0-mapper-sorted
> sort /var/folders/f8/70g84lyd3n387wjwhbfq4dp80000gn/T/mrpagerank.bshur.20160310.182415.688646/step-0-mapper_part-00000
writing to /var/folders/f8/70g84lyd3n387wjwhbfq4dp80000gn/T/mrpagerank.bshur.20160310.182415.688646/step-0-reducer_part-00000