In [2]:
#Reload changes -> always run this
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


__==Undirected toy network dataset==__

In an undirected network all links are symmetric, 
i.e., for a pair of nodes 'A' and 'B,' both of the links:

A -> B and B -> A

will exist. 

The toy data are available in a sparse (stripes) representation:

(node) \t (dictionary of links)

on AWS/Dropbox via the url:

s3://ucb-mids-mls-networks/undirected_toy.txt
On under the Data Subfolder for HW7 on Dropbox with the same file name. 
The Data folder is in: https://db.tt/Kxu48mL1)

In the dictionary, target nodes are keys, link weights are values 
(here, all weights are 1, i.e., the network is unweighted).


__=Directed toy network dataset==__

In a directed network all links are not necessarily symmetric, 
i.e., for a pair of nodes 'A' and 'B,' it is possible for only one of:

A -> B or B -> A

to exist. 

These toy data are available in a sparse (stripes) representation:

(node) \t (dictionary of links)

on AWS/Dropbox via the url:

s3://ucb-mids-mls-networks/directed_toy.txt
On under the Data Subfolder for HW7 on Dropbox with the same file name

In the dictionary, target nodes are keys, link weights are values 
(here, all weights are 1, i.e., the network is unweighted).


## 7.0

===HW 7.0: Shortest path graph distances (toy networks)===

In this part of your assignment you will develop the base of your code for the week.

Write MRJob classes to find shortest path graph distances, 
as described in the lectures. In addition to finding the distances, 
your code should also output a distance-minimizing path between the source and target.
Work locally for this part of the assignment, and use 
both of the undirected and directed toy networks.

To proof you code's function, run the following jobs

- shortest path in the undirected network from node 1 to node 4
Solution: 1,5,4 

- shortest path in the directed network from node 1 to node 5
Solution: 1,2,4,5

and report your output---make sure it is correct!


In [2]:
%%writefile init_data.py

from sys import maxint
from mrjob.job import MRJob
from mrjob.step import MRStep

class initGraphJob(MRJob):
    
    def configure_options(self):
        super(initGraphJob, self).configure_options()
        self.add_passthrough_option('--startNode', default = '1')
    
    def mapper(self, _, node):
        nodeID, links = node.split('\t') #split on input tab
        links = eval(links) #make a dictionary
        
        if nodeID == self.options.startNode: 
            yield nodeID, (links.keys(), 0, 'Q', [nodeID]) #sets up start node
        else:
            yield nodeID, (links.keys(), maxint, 'U', [])
            
    def steps(self):
        return [MRStep(mapper = self.mapper)]

if __name__ == "__main__":
    initGraphJob.run()

Overwriting init_data.py


In [3]:
#Runs job and outputs results to newfile, requires changing parameters and probably want to change
from init_data import initGraphJob

mr_job = initGraphJob(args = ['directed_toy.txt'])

with open('newgraph.txt', 'w+') as myfile:
    with mr_job.make_runner() as runner:
        runner.run()
        for line in runner.stream_output():
            myfile.write(line)



In [4]:
%%writefile shortestPathJob.py

from mrjob.job import MRJob
from mrjob.step import MRStep
import sys

class ShortestPathJob(MRJob):
    
    def mapper(self, _, line):
        newline = line.strip().split('\t')
        
        node = eval(newline[0])
        
        data = eval(newline[1])
        neighbors = (data[0])
        distance = int(data[1])
        label = data[2]
        path = data[3]
        
        if label == 'Q':
            for neighbor in neighbors:
                newPath = list(path)
                newPath.append(neighbor)
                yield neighbor, [None, distance + 1, 'Q', newPath]
            yield node, [neighbors, distance, 'V', path]
        else:
            yield node, [neighbors, distance, label, path]
     

    
    def reducer(self, key, values):
        #By default assume a node is unvisited with an empty list of neighbors, makes updating below easier
        neighbors = [] 
        distance = sys.maxint
        label = 'U'
        path = []
        
        for value in values:
            
            temp_neighbors = value[0]
            temp_distance = value[1]
            temp_label = value[2]
            temp_path = value[3]
            
            if temp_label == 'V':
                neighbors = temp_neighbors
                distance = temp_distance
                label = temp_label
                path = temp_path
                break
            
            elif temp_label == 'Q':
                label = temp_label
                distance = temp_distance
                path = temp_path
                
            elif temp_label == 'U':
                neighbors = temp_neighbors
                
        yield key, [neighbors, distance, label, path]
        
        
            
if __name__ == '__main__':
    ShortestPathJob.run()

Overwriting shortestPathJob.py


In [6]:
from shortestPathJob import ShortestPathJob
mr_job = ShortestPathJob(args = ['newgraph.txt'])

with mr_job.make_runner() as runner:
    runner.run()
    for line in runner.stream_output():
        print mr_job.parse_output_line(line)



('1', [['2', '6'], 0, 'V', [1]])
('2', [['1', '3', '4'], 1, 'V', [1, '2']])
('3', [['2', '4'], 2, 'Q', [1, '2', '3']])
('4', [['2', '5'], 2, 'Q', [1, '2', '4']])
('5', [['1', '2', '4'], 9223372036854775807, 'U', []])
('6', [[], 1, 'V', [1, '6']])


Example of first mapper for reference
('2', [None, 1, 'Q', [1, '2']])
('6', [None, 1, 'Q', [1, '6']])
('1', [['2', '6'], 0, 'V', [1]])
('2', [['1', '3', '4'], 9223372036854775807, 'U', []])
('3', [['2', '4'], 9223372036854775807, 'U', []])
('4', [['2', '5'], 9223372036854775807, 'U', []])
('5', [['1', '2', '4'], 9223372036854775807, 'U', []])

In [5]:
#Driver for iterations
import os
from init_data import initGraphJob
from shortestPathJob import ShortestPathJob

def findShortestPath(filename, startNode, endNode):
    
    mr_job_init = initGraphJob(args = [filename, '--startNode', startNode]) #insert text file here to change 

    with open('working-graph.txt', 'w+') as myfile:
        with mr_job_init.make_runner() as runner:
            runner.run()
            for line in runner.stream_output():
                myfile.write(line)


    while True:
        with open('newFile.txt', 'w+') as myfile:
            mr_job = ShortestPathJob(args = ['working-graph.txt'])
            with mr_job.make_runner() as runner:
                runner.run()
                for line in runner.stream_output():
                    output = mr_job.parse_output_line(line)
                    myfile.write(line)
                    if output[0] == endNode and output[1][2] == "V":
                        return (output[1][3], output[1][1]) #path
                        break
            
        os.rename('newFile.txt', 'working-graph.txt')

print 'Shortest path in undirected Graph:'
results = findShortestPath('undirected_toy.txt', '1', '4')
print 'Path: ' + str(results[0]) + ', with distance ' + str(results[1])

print 'Shortest path in directed Graph:'
results = findShortestPath('directed_toy.txt', '1', '5')
print 'Path: ' + str(results[0]) + ', with distance ' + str(results[1])





Shortest path in undirected Graph:
Path: ['1', '5', '4'], with distance 2




Shortest path in directed Graph:
Path: ['1', '2', '4', '5'], with distance 3


For the __directed graph__ from the output we get: 

[1, "2", "4", "5"] as the path from 1 to 5

For the __undirected graph__ from the output we get:

[1, "5", "4"] as the graph from 1 to 4

## ==Main dataset 1: NLTK synonyms==

In the next part of this assignment you will explore a network derived from
the NLTK synonym database used for evaluation in HW 5. At a high level, this
network is undirected, defined so that there exists link between two nodes/words 
if the pair or words are a synonym. These data may be found at the location:

s3://ucb-mids-mls-networks/synNet/synNet.txt
s3://ucb-mids-mls-networks/synNet/indices.txt
On under the Data Subfolder for HW7 on Dropbox with the same file names

where synNet.txt contains a sparse representation of the network:

(index) \t (dictionary of links)

in indexed form, and indices.txt contains a lookup list

(word) \t (index)

of indices and words. This network is small enough for you to explore and run
scripts locally, but will also be good for a systems test (for later) on AWS.

In the dictionary, target nodes are keys, link weights are values 
(here, all weights are 1, i.e., the network is unweighted).


## ===HW 7.1: Exploratory data analysis (NLTK synonyms)===

Using MRJob, explore the synonyms network data.
Consider plotting the degree distribution (does it follow a power law?),
and determine some of the key features, like:

number of nodes, 

number links,

or the average degree (i.e., the average number of links per node),
etc...

As you develop your code, please be sure to run it locally first (though on the whole dataset). 
Once you have gotten you code to run locally, deploy it on AWS as a systems test
in preparation for our next dataset (which will require AWS).


In [61]:
%%writefile numberNodesMR.py

from mrjob.job import MRJob
from mrjob.step import MRStep

class NumberNodes(MRJob):
    
    def mapper1(self, _, line):
        newLine = line.split('\t')
        
        node = newLine[0]
        neighbors = eval(newLine[1])
        yield node, 1
        for neighbor in neighbors.keys():
            yield neighbor, 1
    
    def reducer1(self, key, values):
        yield key, 1
    
    def mapper2(self, key, values):
        yield None, 1
    
    def reducer2(self, key, values):
        total = sum(values)
        yield None, total
    
        
    def steps(self):
        return [MRStep(mapper = self.mapper1, reducer = self.reducer1),
               MRStep(mapper = self.mapper2, reducer = self.reducer2)]
    
if __name__ == "__main__":
    NumberNodes.run()

Overwriting numberNodesMR.py


In [63]:
from numberNodesMR import NumberNodes

filename = 'undirected_toy.txt'

mr_job = NumberNodes(args = [filename])

with mr_job.make_runner() as runner:
    runner.run()
    print "Number of nodes in " + filename
    for line in runner.stream_output():
        print mr_job.parse_output_line(line)



Number of nodes in undirected_toy.txt
(None, 5)


In [3]:
%%writefile numberLinks.py

from mrjob.job import MRJob
from mrjob.step import MRStep

class NumberLinksJob(MRJob):
    
    def mapper(self, _, line):
        newLine = line.strip().split('\t')
        
        node = newLine[0]
        neighbors = eval(newLine[1])
        if len(neighbors) != 0:
            for neighbor in neighbors.keys():
                yield None, 1
    
    def reducer(self, key, values):
        total = sum(values)
        yield None, total

if __name__ == "__main__":
    NumberLinksJob.run()

Writing numberLinks.py


In [5]:
from numberLinks import NumberLinksJob

filename = 'directed_toy.txt'

mr_job = NumberLinksJob(args = [filename])

with mr_job.make_runner() as runner:
    runner.run()
    print "Number of links in " + filename
    for line in runner.stream_output():
        print mr_job.parse_output_line(line)



Number of links in directed_toy.txt
(None, 12)


In [2]:
## For average links unsure about how to handle directed graph

In [7]:
from numberNodesMR import NumberNodes
from numberLinks import NumberLinksJob

filenames = ['synNet/synNet.txt']

for filename in filenames:
    mr_job = NumberNodes(args = [filename])
    
    print "Number of Nodes in " + filename
    with mr_job.make_runner() as runner:
        runner.run()
        for line in runner.stream_output():
            print mr_job.parse_output_line(line)
    
    mr_job2 = NumberLinksJob(args = [filename])
    
    
    print "Number of Links in " + filename
    with mr_job2.make_runner() as runner:
        runner.run()
        for line in runner.stream_output():
            print mr_job2.parse_output_line(line)





Number of Nodes in synNet/synNet.txt
(None, 8271)




Number of Links in synNet/synNet.txt
(None, 61134)


In [None]:
#To run on AWS use the following command:
#change the .py file and input and output directories to match the job
# python numberLinks.py -r emr \
# s3://hw7nltk/synNet.txt \ 
# --conf-path mrjob.conf \
# --output-dir=s3://dunmireg/HW7/numberLinksNLTK \ 
# --no-output \
# --no-strict-protocol

Confirmed numbers run on EMR match run locally

## ===HW 7.2: Shortest path graph distances (NLTK synonyms)===

Write (reuse your code from 7.0) an MRJob class to find shortest path graph distances, 
and apply it to the NLTK synonyms network dataset. 

Proof your code's function by running the job:

- shortest path starting at "walk" (index=7827) and ending at "make" (index=536),

and showing you code's output. Once again, your output should include the path and the distance.

As you develop your code, please be sure to run it locally first (though on the whole dataset). 
Once you have gotten you code to run locally, deploy it on AWS as a systems test
in preparation for our next dataset (which will require AWS).

In [6]:
results = findShortestPath('synNet/synNet.txt', '7827', '536')
print 'Path: ' + str(results[0]) + ', with distance ' + str(results[1])



Path: ['7827', '4655', '631', '536'], with distance 3


In [3]:
from init_data import initGraphJob
from shortestPathJob import ShortestPathJob

def findShortestPath2(filename, startNode, endNode, clusterID):
    
    counter = 0
    
    mr_job_init = initGraphJob(args = [filename, '--startNode', startNode,
                                      '--no-strict-protocols', '-r', 'emr', 
                                      '--emr-job-flow-id', clusterID,
                                      '--output-dir', 's3://dunmireg/HW7/output' + str(counter),
                                      '--no-output']) 
  

    with mr_job_init.make_runner() as runner:
        runner.run()


    iterate = True
    while iterate:
        counter += 1
        mr_job = ShortestPathJob(args = ['s3://dunmireg/HW7/output' + str(counter - 1) + '/', 
                                        '--no-strict-protocols', '-r', 'emr',
                                        '--emr-job-flow-id', clusterID,
                                        '--output-dir', 's3://dunmireg/HW7/output' + str(counter),
                                        '--no-output'])
        with mr_job.make_runner() as runner:
            runner.run()
            for line in runner.stream_output():
                output = mr_job.parse_output_line(line)
                if output[0] == endNode and output[1][2] == "V":
                    print "The path is: " + str(output[1][3])
                    print "In a distance of: " + str(output[1][1]) #path
                    iterate = False
                    break


In [4]:
!python -m mrjob.tools.emr.create_job_flow '--conf-path' 'mrjob.conf' #need to add configuration file here

creating new scratch bucket mrjob-d58d9fd7143d64f6
using s3://mrjob-d58d9fd7143d64f6/tmp/ as our scratch dir on S3
Creating persistent job flow to run several jobs in...
creating tmp directory /var/folders/tx/cr7tg62d7rdd750f_czjczfm0000gn/T/no_script.dunmireg.20160309.131415.076030
writing master bootstrap script to /var/folders/tx/cr7tg62d7rdd750f_czjczfm0000gn/T/no_script.dunmireg.20160309.131415.076030/b.py
creating S3 bucket 'mrjob-d58d9fd7143d64f6' to use as scratch space
Copying non-input files into s3://mrjob-d58d9fd7143d64f6/tmp/no_script.dunmireg.20160309.131415.076030/files/
Waiting 5.0s for S3 eventual consistency
Creating Elastic MapReduce job flow
Can't access IAM API, trying default instance profile: EMR_EC2_DefaultRole
Can't access IAM API, trying default service role: EMR_DefaultRole
Job flow created with ID: j-15ZNE1DJBYI7U
j-15ZNE1DJBYI7U


In [5]:
#infinite loop problem
findShortestPath2('s3://dunmireg/Input/directed_toy.txt', '1', '5', 'j-15ZNE1DJBYI7U')
# results = findShortestPath('synNet/synNet.txt', '7827', '536')
# print 'Path: ' + str(results[0]) + ', with distance ' + str(results[1])



The path is: ['1', '2', '4', '5']
In a distance of: 3


# _Still to complete_

## ==Main dataset 2: English Wikipedia==

For the remainder of this assignment you will explore the English Wikipedia hyperlink network.
The dataset is built from the Sept. 2015 XML snapshot of English Wikipedia.
For this directed network, a link between articles: 

A -> B

is defined by the existence of a hyperlink in A pointing to B.
This network also exists in the indexed format:

Data: s3://ucb-mids-mls-networks/wikipedia/all-pages-indexed-out.txt
Data: s3://ucb-mids-mls-networks/wikipedia/all-pages-indexed-in.txt
Data: s3://ucb-mids-mls-networks/wikipedia/indices.txt
On under the Data Subfolder for HW7 on Dropbox with the same file names

but has an index with more detailed data:

(article name) \t (index) \t (in degree) \t (out degree)

In the dictionary, target nodes are keys, link weights are values .
Here, a weight indicates the number of time a page links to another.
However, for the sake of this assignment, treat this an unweighted network,
and set all weights to 1 upon data input.


## ===HW 7.3: Exploratory data analysis (Wikipedia)===

Using MRJob, explore the Wikipedia network data on the AWS cloud. Reuse your code from HW 7.1---does is scale well? 
Be cautioned that Wikipedia is a directed network, where links are not symmetric. 
So, even though a node may be linked to, it will not appear as a primary record itself if it has no out-links. 
This means that you may have to ADJUST your code (depending on its design). 
To be sure of your code's functionality in this context, run a systems test on the directed_toy.txt network.


In [None]:
#To run on AWS use the following command:
#change the .py file and input and output directories to match the job
# python numberNodesMR.py -r emr \
# s3://ucb-mids-mls-networks/wikipedia/all-pages-indexed-out.txt \
# --conf-path mrjob.conf \
# --output-dir=s3://dunmireg/HW7/numberNodesWiki \ 
# --no-output \
# --no-strict-protocol

__15,192,277 Nodes__

In [None]:
#To run on AWS use the following command:
#change the .py file and input and output directories to match the job
# python numberLinks.py -r emr \
# s3://ucb-mids-mls-networks/wikipedia/all-pages-indexed-out.txt \
# --conf-path mrjob.conf \
# --output-dir=s3://dunmireg/HW7/numberLinksWiki \ 
# --no-output \
# --no-strict-protocol

__142,114,057 Links__