# Write your full name here
# Assignment 1: MapReduce - Blog Posts and Comments

## Input file description

You are given the `comments.csv` input file. It is a Comma Separated Values (CSV) file that stores comment relationships. More specifically, the file includes 1 million rows and each row stores a tuple of the form:

`PostAuthor,CommentAuthor,CommentDate`

The tuple contains three fields (columns):

* `PostAuthor`: this is a blog user who authored a blog post.
* `CommentAuthor`: this is another user who has commented on the post of the PostAuthor.
* `CommentDate`: this field stores the date that the comment was made.


## Tasks

You will write 3 MapReduce jobs **here** by using MRJob:

* The first job will scan the input file and for each `PostAuthor` it will construct a tuple that contains:
 - The number of comments made to *all* of his/her posts, and
 - A list of the comments in the form `(CommentAuthor,CommentDate)`. The list must be sorted in decreasing `CommentDate` order. Namely, the most recent comment must be placed at the top, followed by the older ones.
 - Example output: `PostAuthor NumberofComments [(Commentator,Date)(Commentator,Date)(Commentator,Date)()...]`

* The second job will scan the input file and for each `PostAuthor` it will construct a tuple that contains:
 - The number of the *distinct* commentators who made a comment to *all* of his/her posts, and
 - A list of the commentators in the form `[(DistinctCommentator1)(DistinctCommentator2)...]`
 - Example output: `PostAuthor NumberofDistrinctCommentators [(DistinctCommentator1)(DistinctCommentator2)...]`
 - Use a combiner here.

* The third job will scan the input file and for each `CommentAuthor` it will construct a tuple that contains:
 - The number of comments that `CommentAuthor` has made to *all* posts, and
 - A list of the comments in the form `(PostAuthor,CommentDate)`. The list must be sorted in decreasing `CommentDate` order. Namely, the most recent comment must be placed at the top, followed by the older ones.
 - Example output: `CommentAuthor NumberofComments [(PostAuthor,Date)(PostAuthor,Date)(PostAuthor,Date)()...]`


## Deliverables

**There will be a single deliverable, this notebook**. You will organize your answers according to the provided structure, which is identical to the example notebooks that were uploaded to the e-learning platform. **Please write your full name in both the notebook's filename and the notebook's title (first line of first cell)**.

Then, upload the file in the e-learning platform.


## Answer to task 1

### 1.1 Python code for MapReduce

In [1]:
%%file task1.py
#!/usr/bin/env python3

from mrjob.job import MRJob
from mrjob.step import MRStep
import csv
from datetime import datetime

class PostAuthorComments(MRJob):

    def mapper(self, _, line):
        # Use csv library to correctly parse CSV lines, even if there are commas within fields
        reader = csv.reader([line])
        for row in reader:
            post_author, comment_author, comment_date = row
            yield post_author, (comment_author, comment_date)

    def reducer(self, key, values):
        # Sort comments by date descending and count them
        sorted_values = sorted(values, key=lambda x: datetime.strptime(x[1], '%Y-%m-%d'), reverse=True)
        num_comments = len(sorted_values)
        yield key, (num_comments, list(sorted_values))

if __name__ == '__main__':
    PostAuthorComments.run()



Writing task1.py


### 1.2 Standalone execution

In [None]:
!python task1.py comments.csv > output1.csv

### 1.3 Running in the Hadoop cluster in a fully/pseudo distributed mode

In [None]:
!python task1.py -r hadoop hdfs:///path/to/comments.csv > output1.csv

### 1.4. Copy the output file from HDFS to the local file system

In [None]:
!hdfs dfs -get /path/to/hadoop/output/part-00000 output1.csv

## Answer to task 2

### 2.1 Python code for MapReduce

In [None]:
%%file task2.py
#!/usr/bin/env python3

from mrjob.job import MRJob
from mrjob.step import MRStep

class DistinctCommentAuthors(MRJob):

    def steps(self):
        return [
            MRStep(mapper=self.mapper_get_authors,
                   combiner=self.combiner_unique_authors,
                   reducer=self.reducer_count_unique_authors)
        ]

    def mapper_get_authors(self, _, line):
        try:
            post_author, comment_author, _ = line.split(',')
            yield post_author, comment_author
        except ValueError:
            pass

    def combiner_unique_authors(self, post_author, comment_authors):
        unique_authors = set(comment_authors)  # Remove duplicates in the combiner
        for author in unique_authors:
            yield post_author, author

    def reducer_count_unique_authors(self, post_author, unique_authors):
        unique_set = set(unique_authors)
        yield post_author, len(unique_set)

if __name__ == '__main__':
    DistinctCommentAuthors.run()


### 2.2 Standalone execution

In [None]:
!python task2.py comments.csv > output2.csv

### 2.3 Running in the Hadoop cluster in a fully/pseudo distributed mode

In [None]:
!python task2.py -r hadoop hdfs:///path/to/comments.csv > output2.csv

### 2.4. Copy the output file from HDFS to the local file system

In [None]:
!hdfs dfs -get /path/to/hadoop/output/part-00000 output2.csv

## Answer to task 3

### 3.1 Python code for MapReduce

In [None]:
%%file task3.py
#!/usr/bin/env python3

from mrjob.job import MRJob
from mrjob.step import MRStep
import csv
from datetime import datetime

class CommentAuthorDetails(MRJob):

    def steps(self):
        return [
            MRStep(mapper=self.mapper_get_posts,
                   reducer=self.reducer_list_posts)
        ]

    def mapper_get_posts(self, _, line):
        try:
            post_author, comment_author, comment_date = line.split(',')
            yield comment_author, (post_author, comment_date)
        except ValueError:
            pass

    def reducer_list_posts(self, key, values):
        posts_list = list(values)
        posts_list.sort(key=lambda post: datetime.strptime(post[1], '%Y-%m-%d'), reverse=True)
        yield key, (len(posts_list), posts_list)

if __name__ == '__main__':
    CommentAuthorDetails.run()


### 3.2 Standalone execution

In [None]:
!python task3.py comments.csv > output3.csv

### 3.3 Running in the Hadoop cluster in a fully/pseudo distributed mode

In [None]:
!python task3.py -r hadoop hdfs:///path/to/comments.csv > output3.csv

### 3.4. Copy the output file from HDFS to the local file system

In [None]:
!hdfs dfs -get /path/to/hadoop/output/part-00000 output3.csv