<a href="https://colab.research.google.com/github/ananyabadkar/movie-ratings-Spark-practice-notebooks-/blob/main/notebooks/%20Movie_Ratings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🚀 Movie Ratings Analysis in Google Colab
This notebook replaces Hadoop + Sandbox setup with simple **mrjob** and **PySpark** in Colab.
Upload your dataset (`ratings.data`) and run the steps.

In [None]:
%%writefile movie_rating_counts.py
from mrjob.job import MRJob

class MovieRatingCounts(MRJob):
    def mapper(self, _, line):
        # Split the input line into parts
        user, movie, rating, timestamp = line.split('\t')
        # Emit movie with count 1
        yield movie, 1

    def reducer(self, movie, counts):
        # Sum up all the counts per movie
        yield movie, sum(counts)

if __name__ == '__main__':
    MovieRatingCounts.run()

Writing movie_rating_counts.py


In [None]:
!python movie_rating_counts.py ratings.data > movie_counts.txt
!head movie_counts.txt


No configs found; falling back on auto-configuration
No configs specified for inline runner
Creating temp directory /tmp/movie_rating_counts.root.20250903.184857.451509
Running step 1 of 1...
job output is in /tmp/movie_rating_counts.root.20250903.184857.451509/output
Streaming final output from /tmp/movie_rating_counts.root.20250903.184857.451509/output...
Removing temp directory /tmp/movie_rating_counts.root.20250903.184857.451509...
"90"	1
"40"	1
"50"	2
"10"	1
"20"	1
"30"	1
"60"	1
"70"	1
"80"	1


In [None]:
!pip install mrjob pyspark

Collecting mrjob
  Downloading mrjob-0.7.4-py2.py3-none-any.whl.metadata (7.3 kB)
Downloading mrjob-0.7.4-py2.py3-none-any.whl (439 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/439.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m430.1/439.6 kB[0m [31m17.2 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m439.6/439.6 kB[0m [31m11.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: mrjob
Successfully installed mrjob-0.7.4


In [None]:
from google.colab import files
uploaded = files.upload()   # Upload ratings.data

Saving ratings.data to ratings.data


## 🔹 MapReduce with `mrjob`

In [None]:
%%writefile movie_ratings.py

from mrjob.job import MRJob

class MovieRatings(MRJob):
    def mapper(self, _, line):
        try:
            user, movie, rating, timestamp = line.split('\t')
            yield movie, int(rating)
        except:
            pass

    def reducer(self, movie, ratings):
        ratings_list = list(ratings)
        yield movie, sum(ratings_list)/len(ratings_list)

if __name__ == '__main__':
    MovieRatings.run()


Writing movie_ratings.py


In [None]:
!python movie_ratings.py ratings.data > output.txt
!head output.txt

No configs found; falling back on auto-configuration
No configs specified for inline runner
Creating temp directory /tmp/movie_ratings.root.20250903.180348.656781
Running step 1 of 1...
job output is in /tmp/movie_ratings.root.20250903.180348.656781/output
Streaming final output from /tmp/movie_ratings.root.20250903.180348.656781/output...
Removing temp directory /tmp/movie_ratings.root.20250903.180348.656781...
"90"	2.0
"40"	4.0
"50"	4.5
"10"	2.0
"20"	5.0
"30"	3.0
"60"	5.0
"70"	3.0
"80"	4.0


## 🔹 Analysis with PySpark (Modern Way)

In [None]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('MovieRatings').getOrCreate()

df = spark.read.csv('ratings.data', sep='\t', inferSchema=True)
df = df.withColumnRenamed('_c0', 'user') \
       .withColumnRenamed('_c1', 'movie') \
       .withColumnRenamed('_c2', 'rating') \
       .withColumnRenamed('_c3', 'timestamp')

avg_ratings = df.groupBy('movie').avg('rating')
avg_ratings.show(10)

+-----+-----------+
|movie|avg(rating)|
+-----+-----------+
|   20|        5.0|
|   40|        4.0|
|   10|        2.0|
|   50|        4.5|
|   80|        4.0|
|   70|        3.0|
|   60|        5.0|
|   90|        2.0|
|   30|        3.0|
+-----+-----------+

