# BDCC HW: Task 2

Task: Implement a MapReduce job that creates a list of followees for each user in the dataset.

Example: the list of followees of user 534 derives by reading the value of the second column in the lines 97097 – 97187.

In [1]:
%%file src/task2.py
#!/usr/bin/env python3


from mrjob.job import MRJob


# Implement a MapReduce job that creates a list of followees for each user in the dataset.
class Followees(MRJob):

    # Arg 1: self: the class itself (this)
    # Arg 2: Input key to the map function (here:none)
    # Arg 3: Input value to the map function (here:one line from the input file)
    def mapper(self, _, line):

        # TODO trailing zeros?

        # yield (follower, followee) pair
        (follower, followee) = line.split()
        yield(int(follower), int(followee))


    # Arg 1: self: the class itself (this)
    # Arg 2: Input key to the reduce function (here: the key that was emitted by the mapper)
    # Arg 3: Input value to the reduce function (here: a generator object; something like a
    # sorted list of ALL values associated with the same key)
    def reducer(self, follower, followees):
        followees_list = [followee for followee in followees]
        yield(follower, followees_list)


if __name__ == '__main__':
    Followees.run()

Overwriting src/task2.py


### Run in Standalone Mode

In [2]:
!python3 src/task2.py data/graph.txt

50176, 156534, 157119, 162626, 162636, 193430, 216893, 218478, 289415, 314827, 321458, 412005, 415879, 434125, 434126, 434127, 434128, 434129, 434130, 434131, 434132, 434133, 434134, 434135, 434136, 434137, 434138, 434139, 434140, 702363, 789602, 789603, 789604]
123471	[390622]
123472	[143789, 318340, 332019, 378188, 433980, 789589]
123473	[433918]
123474	[138469, 177798, 337633]
123475	[434003]
123478	[434028, 434029, 434030]
123479	[433978, 433979, 1034018, 1052177]
123482	[303093, 800858]
123483	[188438]
123484	[124187, 181443, 186719, 198499, 221384, 433936, 433937, 433938, 433939]
123485	[125879, 126200, 161104, 162463, 203939, 211539, 218326, 245676, 249662, 292070, 356317, 404741, 433987, 433988, 433989]
123486	[135093, 136599, 136998, 138145, 139274, 139537, 142332, 142760, 144083, 144919, 147193, 160220, 263532, 368311, 433981, 433982, 433983, 433984, 433985, 447704, 692139, 789637, 789638, 789639, 789640, 789641]
123487	[434035, 434036]
123488	[162626, 255292, 346155, 434354,

### Run in the Hadoop cluster in a fully/pseudo distributed mode

In [18]:
!python3 src/task2.py -r hadoop data/graph.txt -o task2_output

No configs found; falling back on auto-configuration
No configs specified for hadoop runner
Looking for hadoop binary in $PATH...
Falling back to 'hadoop'
Traceback (most recent call last):
  File "src/task1.py", line 31, in <module>
    Followers.run()
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/mrjob/job.py", line 616, in run
    cls().execute()
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/mrjob/job.py", line 687, in execute
    self.run_job()
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/mrjob/job.py", line 636, in run_job
    runner.run()
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/mrjob/runner.py", line 503, in run
    self._run()
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/mrjob/hadoop.py", line 325, in _run
    self._find_binaries_and_jars()
  File "/Library/Frameworks/Python

### Copy the output from HDFS to local file system.

In [None]:
!hdfs dfs -copyToLocal task2_output /home/bdccuser/bdcc-assignment1/output/task2