<a href="https://colab.research.google.com/github/camille-310/depot-UA/blob/main/02_Exe_MapReduce.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h2>MapReduce Mini-Project: Analyzing Amazon Movie Reviews</h2>

<p>
In this exercise, you will work as a data engineer for a streaming platform.
Your goal is to perform several analytics tasks on a free and publicly
available dataset of Amazon Movie Reviews using MapReduce in Hadoop.
</p>

<p>
You will complete four tasks:
</p>

<ol>
  <li><b>Count total number of reviews per movie</b></li>
  <li><b>Compute average rating per movie</b></li>
  <li><b>Extract frequent keywords from reviews</b></li>
  <li><b>Join average ratings with top keywords</b></li>
</ol>

<p>
For each task, you will write a MapReduce program (Python Streaming or Java)
and run it using Hadoop in local mode. Your final outputs will help the
company understand which movies are popular, how viewers rate them, and what
keywords often appear in the reviews.
</p>

<h2>About the Dataset</h2>

<p>
We will use the <b>Amazon Movies &amp; TV 5-core dataset</b>, which is publicly
available and contains movie reviews from Amazon. Each entry in the dataset
is stored as a JSON object with fields such as:
</p>

<ul>
  <li><code>reviewerID</code> – the ID of the reviewer</li>
  <li><code>asin</code> – unique movie identifier</li>
  <li><code>reviewText</code> – full written review</li>
  <li><code>overall</code> – the star rating (1 to 5)</li>
  <li><code>vote</code> – how many users found the review helpful</li>
  <li><code>category</code> – always “Movies &amp; TV” in this dataset</li>
</ul>

<p>
You will download the dataset and inspect a few records to understand its
structure before starting the tasks.
</p>

In [1]:
import gzip
import json
import os
import sys # Import sys for printing warnings to stderr

# -------------------------------------------------------------------
# 1) Download the SMALL Movies & TV dataset (correct version)
# -------------------------------------------------------------------
print("Downloading SMALL Movies & TV 5-core dataset...")

URL = "https://jmcauley.ucsd.edu/data/amazon_v2/categoryFilesSmall/Movies_and_TV_5.json.gz"
FILE_GZ = "Movies_and_TV_small.json.gz"

!wget --no-check-certificate -O {FILE_GZ} {URL}

if os.path.getsize(FILE_GZ) == 0:
    raise ValueError("Downloaded file is empty!")

print("Download complete.\n")

# -------------------------------------------------------------------
# 2) Load JSON data (each line is a JSON object)
# -------------------------------------------------------------------
print("Loading JSON data from JSON Lines format...")

data = []
with gzip.open(FILE_GZ, "rt", encoding="utf-8") as f:
    for line_num, line in enumerate(f, 1):
        line = line.strip()
        if line: # Only process non-empty lines
            try:
                data.append(json.loads(line))
            except json.JSONDecodeError as e:
                print(f"Warning: Could not decode JSON on line {line_num}: {line}. Error: {e}", file=sys.stderr)
                # Continue to the next line to be robust against malformed lines
                continue

print(f"Total records loaded: {len(data)}") # Should be ~3.4 million records
print()

# -------------------------------------------------------------------
# 3) Convert to JSON-LINES format for MapReduce (if not already done)
#    This step ensures 'movies.json' is a clean JSON-Lines file.
# -------------------------------------------------------------------
print("Converting to JSON-lines format (outputting to movies.json with 900,000 records)...")

# Limit to 900,000 records to have less runing times on colab (on a real cluster, remove this line)
limited_data = data[:900000]

with open("movies.json", "w", encoding="utf-8") as out:
    for entry in limited_data:
        out.write(json.dumps(entry) + "\n")

print(f"Conversion complete. Saved as movies.json with {len(limited_data)} records\n")

# -------------------------------------------------------------------
# 4) Preview
# -------------------------------------------------------------------
print("Sample entries:\n")

with open("movies.json", "r", encoding="utf-8") as f:
    for i in range(3):
        line = f.readline()
        if not line: # Check for end of file
            print("Not enough lines in movies.json to display 3 samples.")
            break
        print(json.loads(line))

Downloading SMALL Movies & TV 5-core dataset...
--2025-12-10 08:19:50--  https://jmcauley.ucsd.edu/data/amazon_v2/categoryFilesSmall/Movies_and_TV_5.json.gz
Resolving jmcauley.ucsd.edu (jmcauley.ucsd.edu)... 137.110.160.73
Connecting to jmcauley.ucsd.edu (jmcauley.ucsd.edu)|137.110.160.73|:443... connected.
  Unable to locally verify the issuer's authority.
HTTP request sent, awaiting response... 200 OK
Length: 791322468 (755M) [application/x-gzip]
Saving to: ‘Movies_and_TV_small.json.gz’


2025-12-10 08:20:07 (47.3 MB/s) - ‘Movies_and_TV_small.json.gz’ saved [791322468/791322468]

Download complete.

Loading JSON data from JSON Lines format...
Total records loaded: 3410019

Converting to JSON-lines format (outputting to movies.json with 900,000 records)...
Conversion complete. Saved as movies.json with 900000 records

Sample entries:

{'overall': 5.0, 'verified': True, 'reviewTime': '11 9, 2012', 'reviewerID': 'A2M1CU2IRZG0K9', 'asin': '0005089549', 'style': {'Format:': ' VHS Tape'}, 

In [2]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://downloads.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz
!tar -xzf hadoop-3.3.6.tar.gz

E: Failed to fetch http://security.ubuntu.com/ubuntu/pool/universe/o/openjdk-8/openjdk-8-jre-headless_8u462-ga%7eus1-0ubuntu2%7e22.04.2_amd64.deb  404  Not Found [IP: 185.125.190.82 80]
E: Failed to fetch http://security.ubuntu.com/ubuntu/pool/universe/o/openjdk-8/openjdk-8-jdk-headless_8u462-ga%7eus1-0ubuntu2%7e22.04.2_amd64.deb  404  Not Found [IP: 185.125.190.82 80]
E: Unable to fetch some archives, maybe run apt-get update or try with --fix-missing?



In [24]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-17-openjdk-amd64"
os.environ["HADOOP_HOME"] = "/content/hadoop-3.3.6"
os.environ["PATH"] += f":{os.environ['HADOOP_HOME']}/bin:{os.environ['HADOOP_HOME']}/sbin"

In [20]:
%%bash
cat > /content/hadoop-3.3.6/etc/hadoop/core-site.xml << EOF
<configuration>
 <property>
  <name>fs.defaultFS</name>
  <value>file:///</value>
 </property>
</configuration>
EOF

<h2>Task 1 — Count Total Number of Reviews per Movie</h2>

<p>
Your first task is to count how many reviews each movie has received. You will
write a MapReduce program where:
</p>

<ul>
  <li>The <b>mapper</b> reads each JSON record, extracts the <code>asin</code>
      field, and emits <code>(asin, 1)</code>.</li>
  <li>The <b>reducer</b> sums the counts for each movie and outputs
      <code>(asin, total_reviews)</code>.</li>
</ul>

<p>
This task is conceptually similar to a word count, but applied to movie IDs.
Complete the mapper and reducer code in the following cell.
</p>

In [5]:
# Write your Mapper and Reducer code for Task 1 here.
# You may use Python Hadoop Streaming or Java MapReduce.

In [21]:
%%writefile mapper1.py
#!/usr/bin/env python3
import sys,json

for line in sys.stdin:
    line = line.strip()
    if not line:
        continue

    try:
        fichier = json.loads(line)
        asin = fichier.get("asin", None)
        if asin:
            print(f"{asin}\t1")
    except json.JSONDecodeError:
        continue

Overwriting mapper1.py


In [22]:
%%writefile reducer1.py
#!/usr/bin/env python3
import sys

current_asin = None
count = 0

for line in sys.stdin:
    asin, value = line.strip().split("\t")
    value = int(value)

    if asin != current_asin:
        if current_asin is not None:
            print(f"{current_asin}\t{count}")
        current_asin = asin
        count = 1
    else:
        count += 1

if current_asin is not None:
    print(f"{current_asin}\t{count}")

Overwriting reducer1.py


In [25]:
!hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-*.jar \
  -input movies.json \
  -output output1 \
  -mapper mapper1.py \
  -reducer reducer1.py \
  -file mapper1.py \
  -file reducer1.py

2025-12-10 08:23:37,791 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
packageJobJar: [mapper1.py, reducer1.py] [] /tmp/streamjob17249907201381693551.jar tmpDir=null
2025-12-10 08:23:39,183 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties
2025-12-10 08:23:39,470 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
2025-12-10 08:23:39,471 INFO impl.MetricsSystemImpl: JobTracker metrics system started
2025-12-10 08:23:39,506 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized!
2025-12-10 08:23:39,835 INFO mapred.FileInputFormat: Total input files to process : 1
2025-12-10 08:23:39,872 INFO mapreduce.JobSubmitter: number of splits:24
2025-12-10 08:23:40,365 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1242610421_0001
2025-12-10 08:23:40,366 INFO mapreduce.JobSubmitter: Executing with tokens: []
2025-12-10 08:23:40,843 INFO mapred.LocalDistributedC

<h2>Task 2 — Compute Average Rating per Movie</h2>

<p>
In this task, you will compute the <b>average rating</b> for each movie.
</p>

<p>The mapper should:</p>
<ul>
  <li>Extract <code>asin</code> and <code>overall</code> (rating)</li>
  <li>Emit <code>(asin, rating)</code></li>
</ul>

<p>The reducer should:</p>
<ul>
  <li>Sum all ratings for each movie</li>
  <li>Count how many ratings were received</li>
  <li>Compute and output the average rating</li>
</ul>

<p>
Use a MapReduce job to generate a list of movies with their average ratings.
</p>

In [9]:
# Write your Mapper and Reducer code for Task 2 here.
# You may use Python Hadoop Streaming or Java MapReduce.

In [10]:
%%writefile mapper2.py
#!/usr/bin/env python3
import sys,json

for line in sys.stdin:
    line = line.strip()
    if not line:
        continue

    try:
        fichier = json.loads(line)
        asin = fichier.get("asin")
        rating = fichier.get("overall")

        if asin is not None and rating is not None:
            print(f"{asin}\t{rating}")
    except json.JSONDecodeError:
        continue

Writing mapper2.py


In [11]:
%%writefile reducer2.py
#!/usr/bin/env python3
import sys

current_asin = None
count = 0
rate = 0.0

for line in sys.stdin:
    asin, value = line.strip().split("\t")
    value = float(value)

    if asin != current_asin:
        if current_asin is not None:
            print(f"{current_asin}\t{rate/count}")

        current_asin = asin
        count = 1
        rate = value
    else:
        count += 1
        rate += value

if current_asin is not None:
    print(f"{current_asin}\t{rate/count}")

Writing reducer2.py


In [26]:
!hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-*.jar \
  -input movies.json \
  -output output2 \
  -mapper mapper2.py \
  -reducer reducer2.py \
  -file mapper2.py \
  -file reducer2.py

2025-12-10 08:24:07,730 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
packageJobJar: [mapper2.py, reducer2.py] [] /tmp/streamjob280738519433541948.jar tmpDir=null
2025-12-10 08:24:08,606 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties
2025-12-10 08:24:08,780 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
2025-12-10 08:24:08,781 INFO impl.MetricsSystemImpl: JobTracker metrics system started
2025-12-10 08:24:08,806 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized!
2025-12-10 08:24:09,094 INFO mapred.FileInputFormat: Total input files to process : 1
2025-12-10 08:24:09,139 INFO mapreduce.JobSubmitter: number of splits:24
2025-12-10 08:24:09,501 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local394460440_0001
2025-12-10 08:24:09,501 INFO mapreduce.JobSubmitter: Executing with tokens: []
2025-12-10 08:24:09,857 INFO mapred.LocalDistributedCach

<h2>Task 3 — Extract Frequent Keywords from Reviews</h2>

<p>
Now you will perform text analysis on the <code>reviewText</code> field.
Your task is to extract meaningful keywords for each movie.
</p>

<p>The mapper should:</p>
<ul>
  <li>Clean and tokenize the text</li>
  <li>Remove punctuation and stopwords</li>
  <li>Emit <code>(asin:word, 1)</code> for each keyword</li>
</ul>

<p>The reducer should:</p>
<ul>
  <li>Sum the counts for each <code>(asin, word)</code> pair</li>
  <li>Output the total frequency of each keyword per movie</li>
</ul>

<p>
This task combines text preprocessing with distributed computation.
</p>

In [13]:
# Write your Mapper and Reducer code for Task 3 here.

In [14]:
%%writefile mapper3.py
#!/usr/bin/env python3
import sys,json,re

STOPWORDS = {
    "i","you","he","she","it","we","they","me","him","her","us","them","my",
    "your","his","her","its","our","their","mine","yours","hers","ours",
    "theirs","the","a","an","and","or","but","so","yet","nor","for","in","on",
    "at","by","for","from","to","of","with","without","into","onto","over",
    "very","too","just","only","also","again","still","even","ever","never",
    "be","am","is","are","was","were","been","being","have","has","had",
    "having","do","does","did","doing","can","could","shall","should","will",
    "would","may","might","must","get","gets","got","getting","make","makes",
    "made","making","go","goes","went","gone","going","see","sees","saw","seen",
    "seeing","know","knows","knew","known","knowing","this","that","these",
    "those","there","here","where","when","why","how","what","which","who",
    "whom","whose","such","many","much","more","most","some","any","all","each",
    "both","either","neither","because","though","although","while","unless",
    "since","until","before","after","about",
}

for line in sys.stdin:
    line = line.strip()
    if not line:
        continue

    try:
        data = json.loads(line)
        asin = data.get("asin")
        text = data.get("reviewText", "")

        if not asin or not text:
            continue

        # lowercase
        text = text.lower()

        # remove punctuation using regex
        text = re.sub(r"[^a-z0-9\s]", " ", text)

        # tokenize
        words = text.split()

        for word in words:
            if len(word) > 2 and word not in STOPWORDS:
                print(f"{asin}:{word}\t1")

    except json.JSONDecodeError:
        continue

Writing mapper3.py


In [15]:
%%writefile reducer3.py
#!/usr/bin/env python3
import sys

current_key = None
count = 0

for line in sys.stdin:
    key, value = line.strip().split("\t")
    value = int(value)

    if key != current_key:
        if current_key is not None:
            print(f"{current_key}\t{count}")
        current_key = key
        count = 1
    else:
        count += 1

# last key
if current_key is not None:
    print(f"{current_key}\t{count}")

Writing reducer3.py


In [27]:
!hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-*.jar \
  -input movies.json \
  -output output3 \
  -mapper mapper3.py \
  -reducer reducer3.py \
  -file mapper3.py \
  -file reducer3.py

2025-12-10 08:24:38,394 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
packageJobJar: [mapper3.py, reducer3.py] [] /tmp/streamjob10058220352667063537.jar tmpDir=null
2025-12-10 08:24:39,265 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties
2025-12-10 08:24:39,419 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
2025-12-10 08:24:39,419 INFO impl.MetricsSystemImpl: JobTracker metrics system started
2025-12-10 08:24:39,443 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized!
2025-12-10 08:24:39,672 INFO mapred.FileInputFormat: Total input files to process : 1
2025-12-10 08:24:39,701 INFO mapreduce.JobSubmitter: number of splits:24
2025-12-10 08:24:40,125 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local52683935_0001
2025-12-10 08:24:40,126 INFO mapreduce.JobSubmitter: Executing with tokens: []
2025-12-10 08:24:40,502 INFO mapred.LocalDistributedCac

<h2>Task 4 — Join Ratings with Top Keywords</h2>

<p>
For this task, you will combine the results of Task 2 (average ratings) and
Task 3 (keyword frequencies) using a <b>reduce-side join</b>.
</p>

<p>
You will provide two inputs to your MapReduce job:
</p>

<ul>
  <li><b>Ratings file</b> with <code>(asin, average_rating)</code></li>
  <li><b>Keywords file</b> with <code>(asin, keyword, count)</code></li>
</ul>

<p>Each mapper should tag its data:</p>

<ul>
  <li><code>("R", rating)</code> for ratings</li>
  <li><code>("K", keyword:count)</code> for keywords</li>
</ul>

<p>
The reducer will receive all entries for a given movie and combine them to
produce an output containing:
</p>

<ul>
  <li>The movie identifier (<code>asin</code>)</li>
  <li>Its average rating</li>
  <li>Its most frequent keywords</li>
</ul>

In [29]:
%%writefile mapper4.py
#!/usr/bin/env python3
import sys,json,re

for line in sys.stdin:
    line = line.strip()
    if not line:
        continue

    parts = line.split()

    # output3
    if len(parts) == 2 and ":" in parts[0]:
        asin_keyword = parts[0]
        count = parts[1]

        if ":" in asin_keyword:
            asin, keyword = asin_keyword.split(":", 1)
            print(f"{asin}\tK\t{keyword}:{count}")

    # output2
    elif len(parts) == 2:
        asin, rating = parts
        print(f"{asin}\tR\t{rating}")

Overwriting mapper4.py


In [30]:
%%writefile reducer4.py
#!/usr/bin/env python3
import sys

current_asin = None
rating = None
keywords = {}  # keyword -> count

def movie(asin, rating, keywords):
    if asin is None:
        return
    if rating is None:
        rating=''
    if keywords:
        top3 = sorted(keywords.items(), key=lambda x: -x[1])[:3]
        top_keywords = ",".join([k for k, c in top3])
    else:
        top_keywords = ""
    print(f"{asin}\t{rating}\t{top_keywords}")

for line in sys.stdin:
    line = line.strip()
    if not line:
        continue

    asin, tag, value = line.split("\t")

    # changement de film
    if current_asin is not None and asin != current_asin:
        movie(current_asin, rating, keywords)
        rating = None
        keywords = {}

    current_asin = asin

    if tag == "R":
        rating = value
    elif tag == "K":
        keyword, count = value.split(":",1)
        keywords[keyword] = int(count)

# dernier film
if current_asin is not None:
    movie(current_asin, rating, keywords)

Overwriting reducer4.py


In [32]:
!hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-*.jar \
  -input output2 \
  -input output3 \
  -output output5 \
  -mapper mapper4.py \
  -reducer reducer4.py \
  -file mapper4.py \
  -file reducer4.py

2025-12-10 08:48:13,954 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
packageJobJar: [mapper4.py, reducer4.py] [] /tmp/streamjob11273470800148529793.jar tmpDir=null
2025-12-10 08:48:14,740 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties
2025-12-10 08:48:14,924 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
2025-12-10 08:48:14,924 INFO impl.MetricsSystemImpl: JobTracker metrics system started
2025-12-10 08:48:14,952 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized!
2025-12-10 08:48:15,323 INFO mapred.FileInputFormat: Total input files to process : 2
2025-12-10 08:48:15,366 INFO mapreduce.JobSubmitter: number of splits:11
2025-12-10 08:48:15,894 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local818402201_0001
2025-12-10 08:48:15,895 INFO mapreduce.JobSubmitter: Executing with tokens: []
2025-12-10 08:48:16,488 INFO mapred.LocalDistributedCa

In [19]:
# Write your Mapper and Reducer code for Task 4 here.