# MapReduce Exercise #2
##### Task 1 - Write a MR job that ranks movies (movie ID) by their popularity.
Use the `ratings.csv` file as input.

##### Task 2 - Create a table over the output and show the top 20 movies (movieID, average rating).

##### Task 3 - Show only movies with more than 10 rankings.

#### Generators
* Tip: You'll need to convert the 'values' generator in the reducer to a list. 
* In our example it's fine, but why do you think it will be dangerous with a bigger dataset?
* How can you solve the problem above without converting the generator into a list?

In [2]:
# Write your code here
from mrjob.job import MRJob
from mrjob.step import MRStep
import math

class MoviesByPopularity(MRJob):
    def steps(self): 
        return [MRStep(mapper=self.mapper_get_ratings, 
                       reducer=self.reducer_count_ratings)]

    def mapper_get_ratings(self, _, line): 
        (userID, movieID, rating, timestamp) = line.split(',')
        if rating != 'rating':
          yield movieID, math.ceil(float(rating))
        
    def reducer_count_ratings(self, movie, ratings):
        ratings = list(ratings)
        if len(ratings)>10:
          yield movie, sum(ratings)/len(ratings)


In [3]:
MoviesByPopularity(args=["/dbfs/FileStore/tables/ratings.csv", "-o", "/dbfs/FileStore/tables/movies_popularity"]).execute()

In [4]:
%sql
CREATE TABLE IF NOT EXISTS movies_popularity(movie STRING, rating FLOAT) USING CSV OPTIONS (path "/FileStore/tables/movies_popularity", header "false", delimiter='\t');
SELECT * FROM movies_popularity order by rating desc limit 20;

movie,rating
1178,4.6666665
2360,4.6666665
3451,4.6363635
1041,4.6363635
1104,4.55
28,4.5454545
7156,4.5384617
318,4.5299683
3275,4.5116277
3468,4.5
