>### HW 7.0: Shortest path graph distances (toy networks)

>In this part of your assignment you will develop the base of your code for the week.

>Write MRJob classes to find shortest path graph distances, 
as described in the lectures. In addition to finding the distances, 
your code should also output a distance-minimizing path between the source and target.
Work locally for this part of the assignment, and use 
both of the undirected and directed toy networks.

>To proof you code's function, run the following jobs

>- shortest path in the undirected network from node 1 to node 4  
Solution: 1,5,4 

>- shortest path in the directed network from node 1 to node 5  
Solution: 1,2,4,5

>and report your output---make sure it is correct!

In [36]:
%%writefile shortest_path.py
from mrjob.job import MRJob

class ShortestPathBFS(MRJob):
    class Node:
        def __init__(self, nodeid, links='{}', distance=-1, state='U', path='_'):
            self.links = eval(links)
            self.distance = int(distance)
            self.STATE = state
            self.ID = nodeid
            self.path = path

        def setVisited(self):
            self.STATE = 'V'

        def setQueued(self):
            self.STATE = 'Q'

        def sendQueuedNodes(self):
            for link_id in self.links:
                newpath = self.path+',' if self.path!='*' else ''
                newpath += self.ID+'_'+link_id 
                yield link_id, '\t'.join([ '{}', str(self.distance+1), 'Q', newpath])
        
        def makeNode(self):
            return '\t'.join([str(self.links), str(self.distance), self.STATE, self.path])
    
    def process_node_occurances(self, nodeID, nodeinfo):
        ''' Parse nodes within reducer 
        '''
        links, distance, state, path = nodeinfo.split('\t')
        return self.Node(nodeID, links, distance, state, path)
        
    def mapper(self, _, line):
        ''' Read each node from temp file
            and send node / queued nodes 
            to stream
        '''
        # read line as a node
        nodeID, links, distance, state, path = line.strip().split('\t')
        current_node = self.Node(nodeID, links, distance, state, path)
        
        # send queued nodes 
        if current_node.STATE == 'Q':
            distance = current_node.distance
            for node_id, node in current_node.sendQueuedNodes():
                yield node_id, node
            current_node.setVisited()
        
        # send current node
        yield current_node.ID, current_node.makeNode()
    
    def reducer(self, nodeID, occurances):
        ''' Join all information for each node 
        '''
        # read each node occurance
        node_data = [ self.process_node_occurances(nodeID, o) for o in occurances ]
        
        # join all node data together 
        node_links = {}
        node_state = 'U'
        node_path = '_'
        node_distance = -1
        for n in node_data:
            # if new distance, process
            if n.distance>-1:
                # if current distance is already processing, check logic
                if node_distance!=-1:
                    # new distance must be smaller than current 
                    if n.distance<node_distance:
                        node_distance = n.distance
                        node_path = n.path
                
                # otherwise, use new distance 
                else:
                    node_distance = n.distance
                    node_path = n.path
            node_links.update(n.links)
            if n.STATE == 'V': 
                node_state = n.STATE
            elif n.STATE != node_state and node_state != 'V': 
                node_state = n.STATE
        current_node = self.Node(nodeID, str(node_links), str(node_distance), node_state, node_path)
        
        # send node 
        yield current_node.ID, current_node.makeNode()

if __name__=='__main__':
    ShortestPathBFS.run()

Overwriting shortest_path.py


In [43]:
import shortest_path, os
reload(shortest_path)
from shortest_path import ShortestPathBFS

def find_distance(file_name, node_num):
    with open(file_name,'r') as r:
        for line in r:
            node_id, _, distance, _, path = line.split('\t')
            if node_id == node_num:
                return (distance, path)

def find_shortest_path(source_file, temp_file, start_node, end_node):
    with open(temp_file, 'w') as w:
        with open(source_file, 'r') as r:
            for line in r:
                line = line.strip()
                nodeid, links = line.split('\t')
                if nodeid == start_node: 
                    distance = 0
                    state = 'Q'
                    path = '*'
                else: 
                    distance = -1
                    state = 'U'
                    path = '_'
                w.write('\t'.join((nodeid, links, str(distance), state, path))+'\n')

    args = [temp_file, '--strict-protocols']
    mrjob = ShortestPathBFS(args=args)

    i = 0
    queue_empty = False
    while not queue_empty and i < 10:
        i += 1
        with mrjob.make_runner() as runner, open(temp_file+'.running', "w") as f:
            runner.run()

            for line in runner.stream_output():
                # write line to temp file 
                nodeid, node = mrjob.parse_output_line(line)
                f.write('\t'.join((nodeid, node))+'\n')

                # check for last iteration 
                _, distance, _, _ = node.split('\t')
                if nodeid == end_node and distance != '-1':
                    queue_empty = True

        os.remove(temp_file)
        os.rename(temp_file+'.running', temp_file)

In [40]:
SOURCE_FILE = 'Data/directed_toy.txt'
TEMP_FILE = 'Data/graph_tmp.txt'
START_NODE = '1'
END_NODE = '5'

find_shortest_path(SOURCE_FILE, TEMP_FILE, START_NODE, END_NODE)

print 'End at iteration {}'.format(i)
distance, path = find_distance(TEMP_FILE, END_NODE)
print 'Distance: {}\n\tPath: {}'.format(distance, path)

End at iteration 3
Distance: 3
	Path: 1_2,2_4,4_5



> ### HW 7.2: Shortest path graph distances (NLTK synonyms)
Write (reuse your code from 7.0) an MRJob class to find shortest path graph distances, and apply it to the NLTK synonyms network dataset.   
Proof your code's function by running the job:
- shortest path starting at "walk" (index=7827) and ending at "make" (index=536), and showing you code's output. Once again, your output should include the path and the distance.

In [45]:
SOURCE_FILE = 'Data/synNet/synNet.txt'
TEMP_FILE = 'Data/synNet_temp.txt'
START_NODE = '7827'
END_NODE = '536'

find_shortest_path(SOURCE_FILE, TEMP_FILE, START_NODE, END_NODE)

print 'End at iteration {}'.format(i)
distance, path = find_distance(TEMP_FILE, END_NODE)
print 'Distance: {}\n\tPath: {}'.format(distance, path)

End at iteration 3
Distance: 3
	Path: 7827_1426,1426_1668,1668_536

