# BDCC HW: Task 1

Task: Implement a MapReduce job that creates a list of followers for each user in the dataset.

Example: the list of followers of user 534 is: [2, 16, 37, 73, 156, 210, 308, 347, 446, 455, 487, 519].

In [3]:
%%file src/task1.py
#!/usr/bin/env python3

from mrjob.job import MRJob


# Implement a MapReduce job that creates a list of followers for each user in the dataset.
class Followers(MRJob):

    # Arg 1: self: the class itself (this)
    # Arg 2: Input key to the map function
    # Arg 3: Input value to the map function (one line from the input file)
    def mapper(self, _, line):
        # yield (follower, followee) pair
        (follower, followee) = line.split()
        yield(followee, follower)


    # Arg 1: self: the class itself (this)
    # Arg 2: Input key to the reduce function (here: the key that was emitted by the mapper)
    # Arg 3: Input value to the reduce function (here: a generator object; something like a
    # sorted list of ALL values associated with the same key)
    def reducer(self, followee, followers):
        followers_list = [follower for follower in followers]
        yield(followee, sorted(followers_list))


if __name__ == '__main__':
    Followers.run()


Overwriting src/task1.py


### Run in Standalone Mode

In [4]:
!python3 src/task1.py data/graph.txt

1129049	[1129047]
112905	[2662]
1129050	[887174]
1129051	[96083, 887508]
1129052	[887668]
1129053	[231547, 889986]
1129054	[1129053]
1129055	[1129053]
1129056	[1129055]
1129057	[890098]
1129058	[890823]
1129059	[891328]
112906	[2662]
1129060	[892176]
1129061	[1129060]
1129062	[1129060]
1129063	[892748]
1129064	[1129063]
1129065	[892812]
1129066	[1129064]
1129067	[893311]
1129068	[893311]
1129069	[893887]
112907	[2662]
1129077	[894217]
1129078	[895086]
1129079	[896033]
112908	[2662, 2674, 2775, 60060, 104343, 104554, 104556, 104558]
1129080	[896202]
1129081	[897572]
1129082	[898148]
1129083	[900181]
1129084	[900396]
1129085	[900485]
1129086	[900485]
1129087	[900485]
1129088	[901614]
1129089	[901943]
112909	[2421, 2633, 2662, 2783, 27020, 27802, 104133]
1129090	[1129089]
1129091	[903954]
1129092	[904790]
1129093	[904790]
1129094	[904790]
1129095	[905640]
1129096	[906090]
1129097	[906293]
1129098	[906304]
1129099	[908051]
11291	[104, 832, 867, 884, 1070, 1177, 1180, 1191, 1227, 1432, 1959

### Run in the Hadoop cluster in a fully/pseudo distributed mode

In [5]:
!python3 src/task1.py -r hadoop data/graph.txt -o task1_output

No configs found; falling back on auto-configuration
No configs specified for hadoop runner
Looking for hadoop binary in $PATH...
Falling back to 'hadoop'
Traceback (most recent call last):
  File "src/task1.py", line 31, in <module>
    Followers.run()
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/mrjob/job.py", line 616, in run
    cls().execute()
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/mrjob/job.py", line 687, in execute
    self.run_job()
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/mrjob/job.py", line 636, in run_job
    runner.run()
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/mrjob/runner.py", line 503, in run
    self._run()
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/mrjob/hadoop.py", line 325, in _run
    self._find_binaries_and_jars()
  File "/Library/Frameworks/Python

### Copy the output from HDFS to local file system.

In [None]:
!hdfs dfs -copyToLocal task1_output /home/bdccuser/bdcc-assignment1/output/task1