## Homework 9
Ron Cordell, Ted Dunmire, Filip Krunic 

-----

### Question 9.1 

*Write a basic MRJob implementation of the iterative PageRank algorithm
that takes sparse adjacency lists as input (as explored in HW 7).*

*Make sure that you implementation utilizes teleportation (1-damping/the number of nodes in the network), and further, distributes the mass of dangling nodes with each iteration
so that the output of each iteration is correctly normalized (sums to 1).*

#### Solution

In spirit of the previous homeworks, we implement a `driver` and subroutine modules to do various parts of the calculation. In particular, the driver will manage the overall process and the PageRank script will calculate the PageRank values for the dataset to be used. 

##### Driver 

Below is the driver for the PageRank process. 

In [1]:
%%writefile driver.py
from __future__ import division

from mrjob.job import MRJob
from mrjob.step import MRStep
from mrjob.emr import EMRJobRunner

from init_pr import pageRank
from number_of_nodes import numNodes

import cPickle as pickle
from collections import defaultdict
from operator import itemgetter
import argparse 

# Storage files 
s3Bucket = 's3://ucb-mids-mls-networks/wikipedia/all-pages-indexed-out.txt'


def getName(obj, namespace):
	return [name for name in namespace if namespace[name] is obj]


def extractValues(job, runner):
	output = defaultdict(int)
	for line in runner.stream_output(): 
		key, value = job.parse_output_line(line)
		output[key] = value

	return output 


def dumpToFile(variable, filename):
	with open(filename, 'w') as f: 
		pickle.dump(variable, f)


def runJob(method, args, emr=False):
	job = method(args=args)

	methodName = getName(method, globals())[0]
	print '\n\t' + 'Running ' + methodName + '...'

	with job.make_runner() as runner: 

		runner.run()
		result = extractValues(job, runner)

		print '\t' + 'Complete: ' + methodName	

		return result 


if __name__ == '__main__':

	# Arguments 
	parser = argparse.ArgumentParser(description='Driver for PageRank in MRJob.')

	parser.add_argument('--emr', default=None, action='store_true',
					   help='Flag for using the EMR (and S3 bucket).')

	parser.add_argument('--iterations', default=10, 
							help='Number of iterations to use for PageRank.')

	parser.add_argument('--file', default='PageRank-test.txt', 
							help='File to be passed to the PageRank class.')

	args = parser.parse_args()

	# Pass 
	if args.emr: 
		argInput = [s3Bucket, '-r', 'emr']

	else: 
		argInput = [args.file]


	# Get node counts 
	totalNodeTuple = runJob(numNodes, args=argInput)
	totalNodes = totalNodeTuple.values()[0]
	
	print 'NUMNODES: %s' % (totalNodes)

	# Execute 
	topNodes = runJob(pageRank, args=argInput + ['--numberOfNodes=%s' % (totalNodes)] + \
					['--iterations=%s' % (args.iterations)])

	nodeTuples = [(key, round(value, 5)) for (key, value) in topNodes.iteritems()]
	sortedNodes = sorted(nodeTuples, key=itemgetter(1), reverse=True)

	# Emit 
	for k, v in sortedNodes:
		print 'ID: %s \t PR: %s' % (k, v)

Overwriting driver.py


##### Number of Nodes 

Below is the class to compute the number of nodes. This is important for PageRank as it could affect the values at the end if the number of nodes is not instantiated accurately. 

In [2]:
%%writefile number_of_nodes.py
from __future__ import division
 
from mrjob.job import MRJob
from mrjob.step import MRStep
 
from collections import defaultdict
from operator import itemgetter

class numNodes(MRJob):

	""" Counts the number of nodes. """

	def mapper(self, _, line):

		""" Emit raw nodes. """

		# Parse 
		line = line.split('\t')
		node = line[0]
		adjacencyList = eval(line[1])

		# Track 
		for neighbor in adjacencyList.keys(): 

		   # Emit raw nodes
		   yield neighbor, None


		# Pass values
		yield node, None


	def reducer_duplicates(self, node, _):

		""" Emit count for each unique node. """

		yield None, 1 


	def reducer_aggregate(self, _, totalNodes):

		""" Aggregate counts. """

		yield None, sum(totalNodes)


	def steps(self):

		return [MRStep(mapper=self.mapper, 
						reducer=self.reducer_duplicates), 

				MRStep(reducer=self.reducer_aggregate)]


if __name__ == '__main__':
	numNodes().run()				

Writing number_of_nodes.py


##### PageRank 

Below is the implementation for PageRank. It defaults to a set number of nodes based on either an estimate for EMR based jobs, or a `11` for the test dataset in question. 

In [3]:
%%writefile init_pr.py
from __future__ import division
 
from mrjob.job import MRJob
from mrjob.step import MRStep
 
from collections import defaultdict
from operator import itemgetter

class pageRank(MRJob):
 
     """ This class implements the page-rank calculation. """


     def configure_options(self):

          """ Load options for the class. """

          super(pageRank, self).configure_options()

          self.add_passthrough_option('--alpha',
               default=0.85, type=float, help='alpha: Dampening factor for teleportation in PageRank')

          self.add_passthrough_option('--iterations',
               default=10, type=int, help='iterations: number of iterations for PageRank')

          self.add_passthrough_option('--manualPower', 
               default=7, type=int, help='manualPower: order of magnitude for number of nodes.')

          self.add_passthrough_option('--numberOfNodes', 
               default=None, type=int, help='numberOfNodes: The number of nodes in your graph. Used for teleporation.')


     def load_options(self, args):

          """ Initializes the arguments for each class. """

          super(pageRank, self).load_options(args)

          self.alpha = self.options.alpha
          self.iterations = self.options.iterations

          # Check number of nodes 
          if self.options.numberOfNodes:
               self.numberOfNodes = self.options.numberOfNodes

          else:
               self.numberOfNodes = pow(10, self.options.manualPower)


     def mapper_init_pr(self, _, line):

          """ This initializes the PageRank algorithm by assembling the node list 
          for the initial PageRank values. """

          # Parse 
          line = line.split('\t')
          node = line[0]
          adjacencyList = eval(line[1])

          # Track 
          for neighbor in adjacencyList.keys(): 

               # Emit raw nodes
               yield neighbor, None


          # Pass values
          yield node, adjacencyList


     def reducer_init_pr(self, node, initTuple):

          """ This attaches initial PageRanks for the algorithm. """

          adjacencyList = dict()

          # Re-discover 
          for element in initTuple:
               if isinstance(element, dict):
                    adjacencyList = element 

          # Initialize PR
          PageRank = float(1) / float(self.numberOfNodes)

          # Emit
          yield node, (adjacencyList, PageRank)


     def mapper_iterate_pr(self, node, nodeTuple):

          """ This projects all of the PageRank weights for each node's neighbor. """

          adjacencyList, PageRank = nodeTuple

          if not adjacencyList:
               pass

          else: 

               # Emit PR 
               for neighbor in adjacencyList.keys(): 
                    yield neighbor, PageRank / len(adjacencyList)

          # Emit structure 
          yield node, adjacencyList


     def reducer_iterate_pr(self, node, PRNodeObject):

          """ This reconstructs the graph structure form the updated PageRanks. """

          updatedPR = 0

          # Combine PR 
          for value in PRNodeObject:
               if isinstance(value, dict):
                    adjacencyList = value 

               else: 
                    updatedPR += value 

          # Damping factor 
          updatedPR = ((1 - self.alpha) / self.numberOfNodes) + self.alpha * updatedPR

          # Emit 
          yield node, (adjacencyList, updatedPR)


     def mapper_sort(self, node, nodeTuple):

          """ Emits the page rank for each node. """

          adjacencyList, PageRank = nodeTuple

          yield None, (node, PageRank)


     def reducer_sort(self, _, PageRankPair):

          """ Keeps the top 100 PageRank values. """

          sortedList = []

          # Iterate and remove 
          for node, score in PageRankPair:

               sortedList.append((node, score))
               sortedList = sorted(sortedList, key=itemgetter(1), reverse=True)

               if len(sortedList) > 100: 
                    sortedList.pop()

          # Emit 
          for node, score in sortedList: 
               yield node, score


     def steps(self):

          """ Determines the steps for the job. Has two phases- initiate PR and iterate. """

          initializeStep = [

               MRStep(mapper=self.mapper_init_pr, 
                         reducer=self.reducer_init_pr)

          ]

          iterateStep = [

               MRStep(mapper=self.mapper_iterate_pr, 
                         reducer=self.reducer_iterate_pr)         

          ]

          sortStep = [

               MRStep(mapper=self.mapper_sort, 
                         reducer=self.reducer_sort)

          ]

          return initializeStep + iterateStep * self.iterations + sortStep
 
 
if __name__ == '__main__':
               pageRank().run()                             

Writing init_pr.py


##### Execute 

We run the code now for the test dataset. 

In [4]:
!python driver.py --iterations=10


	Running numNodes...
	Complete: numNodes
NUMNODES: 11

	Running pageRank...
	Complete: pageRank
ID: C 	 PR: 0.30973
ID: B 	 PR: 0.30362
ID: E 	 PR: 0.06821
ID: D 	 PR: 0.03298
ID: F 	 PR: 0.03298
ID: A 	 PR: 0.02765
ID: G 	 PR: 0.01364
ID: I 	 PR: 0.01364
ID: H 	 PR: 0.01364
ID: K 	 PR: 0.01364
ID: J 	 PR: 0.01364


No handlers could be found for logger "mrjob.runner"


### Question 9.3 

*Run your PageRank implementation on the Wikipedia dataset for 10 iterations,
and display the top 100 ranked nodes (with alpha = 0.85).*

*Run your PageRank implementation on the Wikipedia dataset for 50 iterations,
and display the top 100 ranked nodes (with teleportation factor of 0.15). 
Have the top 100 ranked pages changed? Comment on your findings. Plot the pagerank values for the top 100 pages resulting from the 50 iterations run. Then plot the pagerank values for the same 100 pages that resulted from the 10 iterations run.*

#### Solution: 

Below we run our PageRank for 5 and 10 iterations, respectively. 

##### 5 Iterations

In [41]:
!python driver.py --emr --iterations=5


	Running numNodes...
No handlers could be found for logger "mrjob.conf"
	Complete: numNodes
NUMNODES: 15192277

	Running pageRank...
	Complete: pageRank
ID: 13455888 	 PR: 0.00049
ID: 4695850 	 PR: 0.00021
ID: 1184351 	 PR: 0.0002
ID: 5051368 	 PR: 0.00019
ID: 2437837 	 PR: 0.00015
ID: 13425865 	 PR: 0.00015
ID: 6076759 	 PR: 0.00014
ID: 4196067 	 PR: 0.00014
ID: 7902219 	 PR: 0.00014
ID: 6113490 	 PR: 0.00013
ID: 1384888 	 PR: 0.00013
ID: 6172466 	 PR: 0.00013
ID: 14112583 	 PR: 0.00013
ID: 10390714 	 PR: 0.00012
ID: 6416278 	 PR: 0.00011
ID: 6237129 	 PR: 0.00011
ID: 1516699 	 PR: 0.00011
ID: 12836211 	 PR: 0.00011
ID: 3191491 	 PR: 0.0001
ID: 15164193 	 PR: 0.0001
ID: 7990491 	 PR: 0.0001
ID: 7835160 	 PR: 0.0001
ID: 10469541 	 PR: 0.0001
ID: 13725487 	 PR: 0.0001
ID: 5154210 	 PR: 0.0001
ID: 9276255 	 PR: 9e-05
ID: 9386580 	 PR: 9e-05
ID: 4198751 	 PR: 9e-05
ID: 2797855 	 PR: 9e-05
ID: 7576704 	 PR: 9e-05
ID: 11253108 	 PR: 9e-05
ID: 3603527 	 PR: 8e-05
ID: 3069099 	 PR: 8e-05
ID:

##### 10 Iterations

In [42]:
!python driver.py --emr --iterations=10


	Running numNodes...
No handlers could be found for logger "mrjob.conf"
	Complete: numNodes
NUMNODES: 15192277

	Running pageRank...
	Complete: pageRank
ID: 13455888 	 PR: 0.00042
ID: 1184351 	 PR: 0.00019
ID: 4695850 	 PR: 0.00018
ID: 5051368 	 PR: 0.00016
ID: 6113490 	 PR: 0.00013
ID: 2437837 	 PR: 0.00013
ID: 13425865 	 PR: 0.00013
ID: 1384888 	 PR: 0.00013
ID: 7902219 	 PR: 0.00013
ID: 6076759 	 PR: 0.00012
ID: 4196067 	 PR: 0.00012
ID: 6172466 	 PR: 0.00011
ID: 14112583 	 PR: 0.00011
ID: 3191491 	 PR: 0.0001
ID: 15164193 	 PR: 0.0001
ID: 10390714 	 PR: 0.0001
ID: 6416278 	 PR: 9e-05
ID: 6237129 	 PR: 9e-05
ID: 9276255 	 PR: 9e-05
ID: 7835160 	 PR: 9e-05
ID: 1516699 	 PR: 9e-05
ID: 10469541 	 PR: 9e-05
ID: 13725487 	 PR: 9e-05
ID: 7576704 	 PR: 9e-05
ID: 7990491 	 PR: 8e-05
ID: 4198751 	 PR: 8e-05
ID: 2797855 	 PR: 8e-05
ID: 12836211 	 PR: 8e-05
ID: 5154210 	 PR: 8e-05
ID: 3603527 	 PR: 7e-05
ID: 3069099 	 PR: 7e-05
ID: 9386580 	 PR: 7e-05
ID: 14503460 	 PR: 7e-05
ID: 14881689 	 P

### Question 9.4 

*Modify your PageRank implementation to produce a topic specific PageRank implementation,
as described in:*

http://www-cs-students.stanford.edu/~taherh/papers/topic-sensitive-pagerank.pdf

*Note in this article that there is a special caveat to ensure that the transition matrix is irreducible. This caveat lies in footnote 3 on page 3:*

	A minor caveat: to ensure that M is irreducible when p
	contains any 0 entries, nodes not reachable from nonzero
	nodes in p should be removed. In practice this is not problematic.

*and must be adhered to for convergence to be guaranteed. Run topic specific PageRank on the following randomly generated network of 100 nodes:*

s3://ucb-mids-mls-networks/randNet.txt (also available on Dropbox)

*which are organized into ten topics, as described in the file:*

s3://ucb-mids-mls-networks/randNet_topics.txt  (also available on Dropbox)

*Since there are 10 topics, your result should be 11 PageRank vectors (one for the vanilla PageRank implementation in 9.1, and one for each topic with the topic specific implementation). Print out the top ten ranking nodes and their topics for each of the 11 versions, and comment on your result. Assume a teleportation factor of 0.15 in all your analyses.

#### Solution: 

To accomplish this, we need to modify our driver and PageRank algorithm. The key changes are the addition of passthrough options to control the topic check and topic iteration. Most of the changes occurr in the `PageRank` class, with minor changes in the driver.  

Below we show both files modified. 

##### Driver

Here we insert arguments to handle the topic-file used to assign topics to each node, and iteration control for each topic. 

In [2]:
%%writefile topic_driver.py
from __future__ import division

from mrjob.job import MRJob
from mrjob.step import MRStep
from mrjob.emr import EMRJobRunner

from topic_sensitive_pr import pageRank
from number_of_nodes import numNodes

import cPickle as pickle
from collections import defaultdict
from operator import itemgetter
import argparse 

# Storage files 
s3Bucket = 's3://ucb-mids-mls-networks/wikipedia/all-pages-indexed-out.txt'


def getName(obj, namespace):
	return [name for name in namespace if namespace[name] is obj]


def extractValues(job, runner):
	output = defaultdict(int)
	for line in runner.stream_output(): 
		key, value = job.parse_output_line(line)
		output[key] = value

	return output 


def dumpToFile(variable, filename):
	with open(filename, 'w') as f: 
		pickle.dump(variable, f)


def runJob(method, args, emr=False):
	job = method(args=args)

	methodName = getName(method, globals())[0]
	print '\n\t' + 'Running ' + methodName + '...'

	with job.make_runner() as runner: 

		runner.run()
		result = extractValues(job, runner)

		print '\t' + 'Complete: ' + methodName	

		return result 


if __name__ == '__main__':

	# Arguments 
	parser = argparse.ArgumentParser(description='Driver for PageRank in MRJob.')

	parser.add_argument('--emr', default=None, action='store_true',
							help='Flag for using the EMR (and S3 bucket).')

	parser.add_argument('--iterations', default=10, 
							help='Number of iterations to use for PageRank.')

	parser.add_argument('--file', default='PageRank-test.txt', 
							help='File to be passed to the PageRank class.')

	parser.add_argument('--topicFile', default='randNet_topics.txt', 
							help='File containing topic-labels for each node.')

	args = parser.parse_args()

	# Pass 
	if args.emr: 
		argInput = [s3Bucket, '-r', 'emr']

	else: 
		argInput = [args.file]


	# Get node counts 
	totalNodeTuple = runJob(numNodes, args=argInput)
	totalNodes = totalNodeTuple.values()[0]
	
	print 'NUMNODES: %s' % (totalNodes)

	# Execute 

	for i in range(10):

		topNodes = runJob(pageRank, args=argInput + ['--numberOfNodes=%s' % (totalNodes)] + \
						
						['--iterations=%s' % (args.iterations)] + \
						
						['--topicFile=%s' % (args.topicFile)] + \

						['--currentTopic=%s' % (i + 1)] \
						
						)

		nodeTuples = [(key, round(value, 5)) for (key, value) in topNodes.iteritems()]
		sortedNodes = sorted(nodeTuples, key=itemgetter(1), reverse=True)

		# Emit 

		print 'TOPIC:%s' % (i + 1)

		for k, v in sortedNodes:
			print 'ID: %s \t PR: %s' % (k, v)

Writing topic_driver.py


##### Topic-Sensitive PageRank

The main addition here is the conditional weight calculation that is used to distinguish nodes from different topics. The weight calculation is the same one as described in the problem statement of Question 9.4. 

In [1]:
%%writefile topic_sensitive_pr.py
from __future__ import division

from mrjob.job import MRJob
from mrjob.step import MRStep

from collections import defaultdict
from operator import itemgetter

class pageRank(MRJob):

	""" This class implements the page-rank calculation. """

	def configure_options(self):
		
		""" Load options for the class. """
		
		super(pageRank, self).configure_options()

		self.add_passthrough_option('--alpha',
			default=0.85, type=float, help='alpha: Dampening factor for teleportation in PageRank')

		self.add_passthrough_option('--iterations',
			default=10, type=int, help='iterations: number of iterations for PageRank')

		self.add_passthrough_option('--manualPower', 
			default=7, type=int, help='manualPower: order of magnitude for number of nodes.')

		self.add_passthrough_option('--numberOfNodes', 
			default=None, type=int, help='numberOfNodes: The number of nodes in your graph. Used for teleporation.')

		self.add_file_option('--topicFile', 
			default=None, type=str, help='topicFile: File containing the topic information for each node.')

		self.add_passthrough_option('--currentTopic', 
			default=None, type=str, help='currentTopic: The current topic for the given PageRank iteration.')

		self.add_passthrough_option('--emittedNumber', 
			default=10, type=int, help='emittedNumber: Top N nodes to emit sorted from the PageRank job.')


	def load_options(self, args):

		""" Initializes the arguments for each class. """

		super(pageRank, self).load_options(args)
		self.alpha = self.options.alpha
		self.iterations = self.options.iterations
		
		# Check number of nodes 
		if self.options.numberOfNodes:
			self.numberOfNodes = self.options.numberOfNodes
		else:
			self.numberOfNodes = pow(10, self.options.manualPower)
		
		# Check topic file 
		if self.options.topicFile: 
			self.topicFile = self.options.topicFile
		else: 
			self.option_parser.error('Please supply a topic file containing node labels.')

		# Check current topic 
		if self.options.currentTopic:
			self.currentTopic = self.options.currentTopic

		else: 
			self.option_parser.error('Please supply a current topic of focus for PageRank.')

		
		# Load topic file
		self.topicListing = defaultdict(str)
		with open(self.topicFile, 'r') as f: 
			for line in f.readlines():

				# Insert 
				node, topic = line.split()
				self.topicListing[node] = topic  

		# Topic sizes 
		self.topicAmounts = defaultdict(int)
		for topic in self.topicListing.values(): 
			self.topicAmounts[topic] += 1


		# Misc 
		self.emittedNumber = self.options.emittedNumber


	def mapper_init_pr(self, _, line):

		""" This initializes the PageRank algorithm by assembling the node list 
		for the initial PageRank values. """

		# Parse 
		line = line.split('\t')
		node = line[0]
		adjacencyList = eval(line[1])

		# Track 
		for neighbor in adjacencyList.keys(): 

			# Emit raw nodes
			yield neighbor, None


		# Pass values
		yield node, adjacencyList


	def reducer_init_pr(self, node, initTuple):

		""" This attaches initial PageRanks for the algorithm. """

		adjacencyList = dict()

		# Re-discover 
		for element in initTuple:
			if isinstance(element, dict):
				adjacencyList = element 

		# Initialize PR
		PageRank = float(1) / float(self.numberOfNodes)

		# Emit
		yield node, (adjacencyList, PageRank)


	def mapper_iterate_pr(self, node, nodeTuple):

		""" This projects all of the PageRank weights for each node's neighbor. """

		adjacencyList, PageRank = nodeTuple

		if not adjacencyList:
			pass

		else: 

			# Emit PR 
			for neighbor in adjacencyList.keys(): 
				yield neighbor, PageRank / len(adjacencyList)

		# Emit structure 
		yield node, adjacencyList


	def reducer_iterate_pr(self, node, PRNodeObject):

		""" This reconstructs the graph structure form the updated PageRanks. """

		updatedPR = 0

		# Combine PR 
		for value in PRNodeObject:
			if isinstance(value, dict):
				adjacencyList = value 

			else: 
				updatedPR += value 

		# Custom weights 
		nodeTopic = self.topicListing[node]
		currentTopicQuantity = self.topicAmounts[self.currentTopic]
		
		if nodeTopic == self.currentTopic: 
			weight = ((1 - self.alpha) / currentTopicQuantity)

		else: 
			weight = self.alpha / (self.numberOfNodes - currentTopicQuantity)


		# Update 
		updatedPR = (1 - self.alpha) * weight + self.alpha * updatedPR

		# Emit 
		yield node, (adjacencyList, updatedPR)


	def mapper_sort(self, node, nodeTuple):

		""" Emits the page rank for each node. """

		adjacencyList, PageRank = nodeTuple

		yield None, (node, PageRank)


	def reducer_sort(self, _, PageRankPair):

		""" Keeps the top N PageRank values. """

		sortedList = []

		# Iterate and remove 
		for node, score in PageRankPair:

			sortedList.append((node, score))
			sortedList = sorted(sortedList, key=itemgetter(1), reverse=True)

			if len(sortedList) > self.emittedNumber: 
				sortedList.pop()

		# Emit 
		for node, score in sortedList: 
			yield node, score


	def steps(self):

		""" Determines the steps for the job. Has two phases- initiate PR and iterate. """

		initializeStep = [

			MRStep(mapper=self.mapper_init_pr, 
					reducer=self.reducer_init_pr)

		]

		iterateStep = [

			MRStep(mapper=self.mapper_iterate_pr, 
					reducer=self.reducer_iterate_pr)         

		]

		sortStep = [

			MRStep(mapper=self.mapper_sort, 
					reducer=self.reducer_sort)

		]

		return initializeStep + iterateStep * self.iterations + sortStep
 
 
if __name__ == '__main__':
			pageRank().run()      

Writing topic_sensitive_pr.py


##### Execution 

Here we run our topic-sensitive PageRank with additional arguments passed to the driver for topic loading and management. 

In [3]:
!python topic_driver.py --file=randNet.txt --topicFile=randNet_topics.txt


	Running numNodes...
	Complete: numNodes
NUMNODES: 100

	Running pageRank...
	Complete: pageRank
TOPIC:1
ID: 15 	 PR: 0.01642
ID: 74 	 PR: 0.01597
ID: 63 	 PR: 0.01586
ID: 100 	 PR: 0.01543
ID: 85 	 PR: 0.01511
ID: 9 	 PR: 0.01503
ID: 58 	 PR: 0.01483
ID: 71 	 PR: 0.01449
ID: 61 	 PR: 0.01446
ID: 52 	 PR: 0.01418

	Running pageRank...
	Complete: pageRank
TOPIC:2
ID: 15 	 PR: 0.01615
ID: 9 	 PR: 0.01613
ID: 58 	 PR: 0.01606
ID: 74 	 PR: 0.01585
ID: 63 	 PR: 0.01569
ID: 71 	 PR: 0.01566
ID: 100 	 PR: 0.01533
ID: 85 	 PR: 0.01495
ID: 52 	 PR: 0.01447
ID: 61 	 PR: 0.01416

	Running pageRank...
	Complete: pageRank
TOPIC:3
ID: 15 	 PR: 0.01737
ID: 74 	 PR: 0.01596
ID: 63 	 PR: 0.01563
ID: 100 	 PR: 0.01534
ID: 9 	 PR: 0.01508
ID: 85 	 PR: 0.01507
ID: 58 	 PR: 0.01472
ID: 61 	 PR: 0.01432
ID: 71 	 PR: 0.01431
ID: 52 	 PR: 0.01429

	Running pageRank...
	Complete: pageRank
TOPIC:4
ID: 15 	 PR: 0.01637
ID: 63 	 PR: 0.01601
ID: 74 	 PR: 0.0159
ID: 100 	 PR: 0.01533
ID: 85 	 PR: 0.01516
ID: 9 	 P

No handlers could be found for logger "mrjob.runner"


##### Vanilla Implementation 

Below is the implementation for normal PageRank on `randNet.txt`.

In [4]:
!python driver.py --file=randNet.txt


	Running numNodes...
	Complete: numNodes
NUMNODES: 100

	Running pageRank...
	Complete: pageRank
ID: 15 	 PR: 0.01636
ID: 74 	 PR: 0.01597
ID: 63 	 PR: 0.01577
ID: 100 	 PR: 0.01538
ID: 85 	 PR: 0.01518
ID: 9 	 PR: 0.01503
ID: 58 	 PR: 0.01483
ID: 71 	 PR: 0.01449
ID: 61 	 PR: 0.01441
ID: 52 	 PR: 0.01431
ID: 77 	 PR: 0.01366
ID: 92 	 PR: 0.01365
ID: 32 	 PR: 0.01331
ID: 13 	 PR: 0.01318
ID: 88 	 PR: 0.01314
ID: 17 	 PR: 0.01307
ID: 70 	 PR: 0.01307
ID: 25 	 PR: 0.01296
ID: 90 	 PR: 0.01286
ID: 49 	 PR: 0.01255
ID: 53 	 PR: 0.01221
ID: 39 	 PR: 0.01208
ID: 51 	 PR: 0.01179
ID: 73 	 PR: 0.01164
ID: 45 	 PR: 0.0116
ID: 99 	 PR: 0.01154
ID: 28 	 PR: 0.01151
ID: 35 	 PR: 0.0115
ID: 56 	 PR: 0.0114
ID: 55 	 PR: 0.01113
ID: 27 	 PR: 0.01112
ID: 10 	 PR: 0.01112
ID: 94 	 PR: 0.01111
ID: 41 	 PR: 0.01109
ID: 95 	 PR: 0.01108
ID: 91 	 PR: 0.01103
ID: 65 	 PR: 0.01085
ID: 86 	 PR: 0.0107
ID: 84 	 PR: 0.01059
ID: 62 	 PR: 0.01056
ID: 46 	 PR: 0.01054
ID: 2 	 PR: 0.01033
ID: 78 	 PR: 0.01027
ID: 

No handlers could be found for logger "mrjob.runner"


### Question 9.5 

*Here you will apply your topic-specific PageRank implementation to Wikipedia,
defining topics (very arbitrarily) for each page by the length (number of characters) of the name of the article mod 10, so that there are 10 topics. Once again, print out the top ten ranking nodes and their topics for each of the 11 versions, and comment on your result. Assume a teleportation factor of 0.15 in all your analyses.*

#### Solution: 

To achieve this, we modify our topic-sensitive PageRank to take the length of the node identifier modulo 10 as the topic. 

##### Modified Topic-Sensitive PageRank