# BDCC HW: Task 1

Task: Implement a MapReduce job that creates a list of followers for each user in the dataset.

Example: the list of followers of user 534 is: [2, 16, 37, 73, 156, 210, 308, 347, 446, 455, 487, 519].

In [19]:
%%file src/task1.py
#!/usr/bin/env python3

from mrjob.job import MRJob


# Implement a MapReduce job that creates a list of followers for each user 
# in the dataset.
class Followers(MRJob):

    # Arg 1: self: the class itself (this)
    # Arg 2: Input key to the map function
    # Arg 3: Input value to the map function (one line from the input file)
    def mapper(self, _, line):
        # yield (follower, followee) pair
        (follower, followee) = line.split()
        yield(followee, follower)


    # Arg 1: self: the class itself (this)
    # Arg 2: Input key to the reduce function (here: the key that was emitted by the mapper)
    # Arg 3: Input value to the reduce function (here: a generator object; something like a
    # sorted list of ALL values associated with the same key)
    def reducer(self, followee, followers):
        followers_list = [follower for follower in followers]
        yield(followee, followers_list)


if __name__ == '__main__':
    Followers.run()


Overwriting src/task1.py


### Run in Standalone Mode

In [16]:
!python3 src/task1.py data/graph.txt

["2662"]
"1132330"	["337057"]
"1132331"	["337321"]
"1132332"	["59839", "338099", "815247", "1072906"]
"1132333"	["338659"]
"1132334"	["1132333"]
"1132335"	["1132333"]
"1132336"	["341383"]
"1132337"	["342816"]
"1132338"	["342926"]
"1132339"	["343061"]
"113234"	["2662", "2700", "2774", "2783"]
"1132340"	["343061"]
"1132341"	["343061"]
"1132342"	["343740"]
"1132343"	["344034"]
"1132344"	["344134"]
"1132345"	["347331"]
"1132346"	["347795"]
"1132347"	["348514"]
"1132348"	["348631"]
"1132349"	["348752"]
"113235"	["2662", "2687", "86431", "104555"]
"1132350"	["350061"]
"1132351"	["1132350"]
"1132352"	["350926"]
"1132353"	["351436"]
"1132354"	["354180"]
"1132355"	["354442"]
"1132356"	["354487"]
"1132357"	["354471"]
"1132358"	["354471"]
"1132359"	["354471"]
"113236"	["2662", "27837"]
"1132360"	["354471"]
"1132361"	["354471"]
"1132362"	["768955", "1132358"]
"1132363"	["1132358"]
"1132364"	["355311"]
"1132365"	["357129"]
"1132366"	["357531"]
"1132367"	["357910"]
"1132368"	["1132367"]
"1132369"	["

### Run in the Hadoop cluster in a fully/pseudo distributed mode

In [18]:
!python3 src/task1.py -r hadoop data/graph.txt -o task1_output

No configs found; falling back on auto-configuration
No configs specified for hadoop runner
Looking for hadoop binary in $PATH...
Falling back to 'hadoop'
Traceback (most recent call last):
  File "src/task1.py", line 31, in <module>
    Followers.run()
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/mrjob/job.py", line 616, in run
    cls().execute()
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/mrjob/job.py", line 687, in execute
    self.run_job()
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/mrjob/job.py", line 636, in run_job
    runner.run()
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/mrjob/runner.py", line 503, in run
    self._run()
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/mrjob/hadoop.py", line 325, in _run
    self._find_binaries_and_jars()
  File "/Library/Frameworks/Python

### Copy the output from HDFS to local file system.

In [None]:
!hdfs dfs -copyToLocal task1_output /home/bdccuser/bdcc-assignment1/output/task1