# Write your full name here
# Assignment 1: MapReduce - Blog Posts and Comments

## Input file description

You are given the `comments.csv` input file. It is a Comma Separated Values (CSV) file that stores comment relationships. More specifically, the file includes 1 million rows and each row stores a tuple of the form:

`PostAuthor,CommentAuthor,CommentDate`

The tuple contains three fields (columns):

* `PostAuthor`: this is a blog user who authored a blog post.
* `CommentAuthor`: this is another user who has commented on the post of the PostAuthor.
* `CommentDate`: this field stores the date that the comment was made.


## Tasks

You will write 3 MapReduce jobs **here** by using MRJob:

* The first job will scan the input file and for each `PostAuthor` it will construct a tuple that contains:
 - The number of comments made to *all* of his/her posts, and
 - A list of the comments in the form `(CommentAuthor,CommentDate)`. The list must be sorted in decreasing `CommentDate` order. Namely, the most recent comment must be placed at the top, followed by the older ones.
 - Example output: `PostAuthor NumberofComments [(Commentator,Date)(Commentator,Date)(Commentator,Date)()...]`

* The second job will scan the input file and for each `PostAuthor` it will construct a tuple that contains:
 - The number of the *distinct* commentators who made a comment to *all* of his/her posts, and
 - A list of the commentators in the form `[(DistinctCommentator1)(DistinctCommentator2)...]`
 - Example output: `PostAuthor NumberofDistrinctCommentators [(DistinctCommentator1)(DistinctCommentator2)...]`
 - Use a combiner here.

* The third job will scan the input file and for each `CommentAuthor` it will construct a tuple that contains:
 - The number of comments that `CommentAuthor` has made to *all* posts, and
 - A list of the comments in the form `(PostAuthor,CommentDate)`. The list must be sorted in decreasing `CommentDate` order. Namely, the most recent comment must be placed at the top, followed by the older ones.
 - Example output: `CommentAuthor NumberofComments [(PostAuthor,Date)(PostAuthor,Date)(PostAuthor,Date)()...]`


## Deliverables

**There will be a single deliverable, this notebook**. You will organize your answers according to the provided structure, which is identical to the example notebooks that were uploaded to the e-learning platform. **Please write your full name in both the notebook's filename and the notebook's title (first line of first cell)**.

Then, upload the file in the e-learning platform.


## Answer to task 1

### 1.1 Python code for MapReduce

In [19]:
%%file task1.py
from mrjob.job import MRJob
import csv
from datetime import datetime

class PostAuthorComments(MRJob):
    def mapper(self, _, line):
        # Parse the line using csv
        reader = csv.reader([line])
        for row in reader:
            post_author, comment_author, comment_date = row
            yield post_author, (comment_author, comment_date)

    def reducer(self, key, values):
        comments = list(values)
        comments.sort(key=lambda x: datetime.strptime(x[1], "%Y-%m-%d"), reverse=True)
        yield key, (len(comments), comments)

if __name__ == '__main__':
    PostAuthorComments.run()

Overwriting task1.py


### 1.2 Standalone execution

In [20]:
!python task1.py comments.csv > job_1_output.txt

No configs found; falling back on auto-configuration
No configs specified for inline runner
Creating temp directory /tmp/task1.hadoop.20240421.185916.532951
Running step 1 of 1...
job output is in /tmp/task1.hadoop.20240421.185916.532951/output
Streaming final output from /tmp/task1.hadoop.20240421.185916.532951/output...
Removing temp directory /tmp/task1.hadoop.20240421.185916.532951...


### 1.3 Running in the Hadoop cluster in a fully/pseudo distributed mode

In [24]:
!hadoop fs -ls /user/hadoop/

Found 12 items
drwxr-xr-x   - hadoop supergroup          0 2024-02-28 23:34 /user/hadoop/.sparkStaging
-rw-r--r--   1 hadoop supergroup   25981917 2024-04-21 22:36 /user/hadoop/comments.csv
drwxr-xr-x   - dr.who supergroup          0 2024-04-09 21:20 /user/hadoop/ihu
drwxr-xr-x   - hadoop supergroup          0 2024-04-21 13:45 /user/hadoop/out_iiDL22
drwxr-xr-x   - hadoop supergroup          0 2024-02-28 23:31 /user/hadoop/out_iiPI
drwxr-xr-x   - hadoop supergroup          0 2024-02-28 23:30 /user/hadoop/out_iiTL
-rw-r--r--   1 hadoop supergroup        889 2024-04-21 21:28 /user/hadoop/sample_data.csv
drwxr-xr-x   - hadoop supergroup          0 2024-02-28 23:35 /user/hadoop/selected_df.csv
-rw-r--r--   1 dr.who supergroup      99993 2024-02-23 01:26 /user/hadoop/shakespear_input.txt
drwxr-xr-x   - hadoop supergroup          0 2024-02-22 12:45 /user/hadoop/tmp
-rw-r--r--   1 hadoop supergroup      61947 2024-02-28 23:35 /user/hadoop/universities_ranking.csv
-rw-r--r--   1 hadoop supergr

In [23]:
!hadoop fs -put comments.csv /user/hadoop/comments.csv

In [25]:
!python task1.py -r hadoop hdfs:///user/hadoop/comments.csv --output-dir hdfs:///user/hadoop/job_1_output

No configs found; falling back on auto-configuration
No configs specified for hadoop runner
Looking for hadoop binary in /home/hadoop/hadoop/bin...
Found hadoop binary: /home/hadoop/hadoop/bin/hadoop
Using Hadoop version 3.3.6
Looking for Hadoop streaming jar in /home/hadoop/hadoop...
Found Hadoop streaming jar: /home/hadoop/hadoop/share/hadoop/tools/lib/hadoop-streaming-3.3.6.jar
Creating temp directory /tmp/task1.hadoop.20240421.193716.124821
uploading working dir files to hdfs:///user/hadoop/tmp/mrjob/task1.hadoop.20240421.193716.124821/files/wd...
Copying other local files to hdfs:///user/hadoop/tmp/mrjob/task1.hadoop.20240421.193716.124821/files/
Running step 1 of 1...
  packageJobJar: [/tmp/hadoop-unjar7858821130576483077/] [] /tmp/streamjob8013802196787040630.jar tmpDir=null
  Connecting to ResourceManager at /0.0.0.0:8032
  Connecting to ResourceManager at /0.0.0.0:8032
  Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/hadoop/.staging/job_1713715981079_0003
  Total 

### 1.4. Copy the output file from HDFS to the local file system

In [26]:
!hadoop fs -ls /user/hadoop/job_1_output

Found 2 items
-rw-r--r--   1 hadoop supergroup          0 2024-04-21 22:39 /user/hadoop/job_1_output/_SUCCESS
-rw-r--r--   1 hadoop supergroup   26007362 2024-04-21 22:39 /user/hadoop/job_1_output/part-00000


In [28]:
!hadoop fs -copyToLocal /user/hadoop/job_1_output/part-00000 job_1_output.txt

## Answer to task 2

### 2.1 Python code for MapReduce

In [29]:
%%file task2.py
#!/usr/bin/env python3

from mrjob.job import MRJob
from mrjob.step import MRStep
import csv

class DistinctCommentators(MRJob):

    def steps(self):
        return [
            MRStep(mapper=self.mapper_get_comments,
                   reducer=self.reducer_count_comments),
            MRStep(reducer=self.reducer_list_commentators)
        ]

    def mapper_get_comments(self, _, line):
        # Parse the line using csv
        reader = csv.reader([line])
        for row in reader:
            post_author, comment_author, _ = row
            yield (post_author, comment_author), None

    def reducer_count_comments(self, author_pair, _):
        post_author, comment_author = author_pair
        yield post_author, comment_author

    def reducer_list_commentators(self, post_author, commentators):
        unique_commentators = set(commentators)
        yield post_author, (len(unique_commentators), list(unique_commentators))

if __name__ == '__main__':
    DistinctCommentators.run()


Writing task2.py


### 2.2 Standalone execution

In [30]:
!python task2.py comments.csv > job_2_output.txt

No configs found; falling back on auto-configuration
No configs specified for inline runner
Creating temp directory /tmp/task2.hadoop.20240421.200317.024797
Running step 1 of 2...
Running step 2 of 2...
job output is in /tmp/task2.hadoop.20240421.200317.024797/output
Streaming final output from /tmp/task2.hadoop.20240421.200317.024797/output...
Removing temp directory /tmp/task2.hadoop.20240421.200317.024797...


### 2.3 Running in the Hadoop cluster in a fully/pseudo distributed mode

In [None]:
!hadoop fs -put comments.csv /user/hadoop/comments.csv

In [31]:
!python task2.py -r hadoop hdfs:///user/hadoop/comments.csv --output-dir hdfs:///user/hadoop/job_2_output

No configs found; falling back on auto-configuration
No configs specified for hadoop runner
Looking for hadoop binary in /home/hadoop/hadoop/bin...
Found hadoop binary: /home/hadoop/hadoop/bin/hadoop
Using Hadoop version 3.3.6
Looking for Hadoop streaming jar in /home/hadoop/hadoop...
Found Hadoop streaming jar: /home/hadoop/hadoop/share/hadoop/tools/lib/hadoop-streaming-3.3.6.jar
Creating temp directory /tmp/task2.hadoop.20240421.200450.732527
uploading working dir files to hdfs:///user/hadoop/tmp/mrjob/task2.hadoop.20240421.200450.732527/files/wd...
Copying other local files to hdfs:///user/hadoop/tmp/mrjob/task2.hadoop.20240421.200450.732527/files/
Running step 1 of 2...
  packageJobJar: [/tmp/hadoop-unjar6937329494835278535/] [] /tmp/streamjob4904205534611988525.jar tmpDir=null
  Connecting to ResourceManager at /0.0.0.0:8032
  Connecting to ResourceManager at /0.0.0.0:8032
  Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/hadoop/.staging/job_1713715981079_0004
  Total 

### 2.4. Copy the output file from HDFS to the local file system

In [32]:
!hadoop fs -ls /user/hadoop/job_2_output

Found 2 items
-rw-r--r--   1 hadoop supergroup          0 2024-04-21 23:07 /user/hadoop/job_2_output/_SUCCESS
-rw-r--r--   1 hadoop supergroup     149446 2024-04-21 23:07 /user/hadoop/job_2_output/part-00000


In [35]:
!hadoop fs -copyToLocal /user/hadoop/job_2_output/part-00000 job_2_output.txt

## Answer to task 3

### 3.1 Python code for MapReduce

In [36]:
%%file task3.py
#!/usr/bin/env python3

from mrjob.job import MRJob
import csv
from datetime import datetime

class CommentAuthorActivity(MRJob):
    def mapper(self, _, line):
        # Parse the line using csv
        reader = csv.reader([line])
        for row in reader:
            post_author, comment_author, comment_date = row
            yield comment_author, (post_author, comment_date)

    def reducer(self, key, values):
        # Sort comments by date in descending order
        sorted_comments = sorted(values, key=lambda x: datetime.strptime(x[1], "%Y-%m-%d"), reverse=True)
        yield key, (len(sorted_comments), list(sorted_comments))

if __name__ == '__main__':
    CommentAuthorActivity.run()


Writing task3.py


### 3.2 Standalone execution

In [37]:
!python task3.py comments.csv > job_3_output.txt

No configs found; falling back on auto-configuration
No configs specified for inline runner
Creating temp directory /tmp/task3.hadoop.20240421.203405.968257
Running step 1 of 1...
job output is in /tmp/task3.hadoop.20240421.203405.968257/output
Streaming final output from /tmp/task3.hadoop.20240421.203405.968257/output...
Removing temp directory /tmp/task3.hadoop.20240421.203405.968257...


### 3.3 Running in the Hadoop cluster in a fully/pseudo distributed mode

In [38]:
!python task3.py -r hadoop hdfs:///user/hadoop/comments.csv --output-dir hdfs:///user/hadoop/job_3_output

No configs found; falling back on auto-configuration
No configs specified for hadoop runner
Looking for hadoop binary in /home/hadoop/hadoop/bin...
Found hadoop binary: /home/hadoop/hadoop/bin/hadoop
Using Hadoop version 3.3.6
Looking for Hadoop streaming jar in /home/hadoop/hadoop...
Found Hadoop streaming jar: /home/hadoop/hadoop/share/hadoop/tools/lib/hadoop-streaming-3.3.6.jar
Creating temp directory /tmp/task3.hadoop.20240421.203816.775457
uploading working dir files to hdfs:///user/hadoop/tmp/mrjob/task3.hadoop.20240421.203816.775457/files/wd...
Copying other local files to hdfs:///user/hadoop/tmp/mrjob/task3.hadoop.20240421.203816.775457/files/
Running step 1 of 1...
  packageJobJar: [/tmp/hadoop-unjar7036806997918749844/] [] /tmp/streamjob7397435104422057290.jar tmpDir=null
  Connecting to ResourceManager at /0.0.0.0:8032
  Connecting to ResourceManager at /0.0.0.0:8032
  Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/hadoop/.staging/job_1713715981079_0006
  Total 

### 3.4. Copy the output file from HDFS to the local file system

In [39]:
!hadoop fs -ls /user/hadoop/job_3_output

Found 2 items
-rw-r--r--   1 hadoop supergroup          0 2024-04-21 23:41 /user/hadoop/job_3_output/_SUCCESS
-rw-r--r--   1 hadoop supergroup   26066459 2024-04-21 23:41 /user/hadoop/job_3_output/part-00000


In [None]:
!hadoop fs -cat /user/hadoop/job_3_output/part-00000

In [41]:
!hadoop fs -copyToLocal /user/hadoop/job_3_output/part-00000 job_3_output.txt