# BDCC HW: Task 3

Task: Implement a MapReduce job that identifies the 100 most followed users in the dataset.

Hint: We are not interested in creating lists of followers here. We just need to count the followers of each user in the Reduce phase. This is called the in-degree of a user. Moreover, a temporary data structure D of fixed 100 positions is required. This structure will be initially filled with the first 100 users that are processed in the Reduce phase. Then, the next users (101, 102, 103…) will replace a user in D, only if their in-degree is greater than the in-degree of the least-followed user in D. Notice that the ideal data structure for D is a min-heap (https://docs.python.org/3/library/heapq.html). However, it is totally acceptable to appropriately use a data structure such as a dictionary, or a list.

In [4]:
%%file src/task3.py
#!/usr/bin/env python3


from mrjob.job import MRJob
from heapq import heappush


# Implement a MapReduce job that creates a list of followees for each user in the dataset.
class MostFollowed(MRJob):

    # Arg 1: self: the class itself (this)
    # Arg 2: Input key to the map function (here:none)
    # Arg 3: Input value to the map function (here:one line from the input file)
    def mapper(self, _, line):

        # TODO sort keys as int

        # yield (followee, 1) pair
        (follower, followee) = line.split()
        yield(int(followee), 1)

    def combiner(self, followee, follower_count):

        # yield sum of followers
        yield(followee, sum(follower_count))


    # Arg 1: self: the class itself (this)
    # Arg 2: Input key to the reduce function (here: the key that was emitted by the mapper)
    # Arg 3: Input value to the reduce function (here: a generator object; something like a
    # sorted list of ALL values associated with the same key)
    def reducer(self, followee, follower_count):

        # TODO get only top 100 using min_heap
        top_followed = []

        yield(followee, sum(follower_count))


if __name__ == '__main__':
    MostFollowed.run()

Overwriting src/task3.py


### Run in Standalone Mode

In [5]:
!python3 src/task3.py data/graph.txt

0341	1
1130342	1
1130343	1
1130344	1
1130345	1
1130346	1
1130347	1
1130348	1
1130349	1
113035	7
1130350	2
1130351	1
1130352	1
1130353	1
1130354	1
1130355	1
1130356	1
1130357	1
1130358	1
1130359	1
113036	1
1130360	3
1130361	1
1130362	1
1130363	1
1130364	1
1130365	1
1130366	1
1130367	1
1130368	1
1130369	2
113037	1
1130370	1
1130371	1
1130372	2
1130373	1
1130374	1
1130375	2
1130376	1
1130377	1
1130378	1
1130379	1
113038	2
1130380	1
1130381	1
1130382	1
1130383	1
1130384	4
1130385	1
1130386	1
1130387	1
1130388	1
1130389	1
113039	1
1130390	1
1130391	1
1130392	1
1130393	1
1130394	1
1130395	2
1130396	1
1130397	1
1130398	1
1130399	1
11304	69
113040	1
1130400	1
1130401	1
1130402	1
1130403	1
1130404	1
1130405	1
1130406	1
1130407	1
1130408	1
1130409	3
113041	1
1130410	1
1130411	1
1130412	1
1130413	1
1130414	1
1130415	1
1130416	2
1130417	1
1130418	2
1130419	1
113042	1
1130420	1
1130421	1
1130422	1
1130423	1
1130424	1
1130425	1
1130426	1
1130427	1
1130428	2
1130429	1
113043	4
1130430	1
1130431	1
113

### Run in the Hadoop cluster in a fully/pseudo distributed mode

In [18]:
!python3 src/task2.py -r hadoop data/graph.txt -o task2_output

No configs found; falling back on auto-configuration
No configs specified for hadoop runner
Looking for hadoop binary in $PATH...
Falling back to 'hadoop'
Traceback (most recent call last):
  File "src/task1.py", line 31, in <module>
    Followers.run()
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/mrjob/job.py", line 616, in run
    cls().execute()
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/mrjob/job.py", line 687, in execute
    self.run_job()
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/mrjob/job.py", line 636, in run_job
    runner.run()
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/mrjob/runner.py", line 503, in run
    self._run()
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/mrjob/hadoop.py", line 325, in _run
    self._find_binaries_and_jars()
  File "/Library/Frameworks/Python

### Copy the output from HDFS to local file system.

In [None]:
!hdfs dfs -copyToLocal task2_output /home/bdccuser/bdcc-assignment1/output/task2