# Big Data Analytics — Assignment 01
> Author : Badr TAJINI - Big Data Analytics - ESIEE 2025-2026

**Chapter 1 :** Introduction to Big Data  
**Chapter 2 :** MapReduce Algorithm Design

**Tools :** Spark or PySpark.   
**Advice:** Keep evidence and reproducibility.

## 0. Bootstrap
Use Profile A from the `BDA_Installation_Guide.md`. Log versions and key Spark configs.

In [1]:
# write some code here
# - create SparkSession('BDA-A01') with UTC timezone
# - print Spark/PySpark/Python versions
# - set spark.sql.shuffle.partitions small for local runs

from datetime import datetime
from pyspark.sql import SparkSession
import pyspark, sys, platform, os

print("Run timestamp (UTC):", datetime.utcnow().isoformat())


# 1. Création la session Spark
spark = (
    SparkSession.builder
    .appName("BDA-A01")
    .config("spark.sql.session.timeZone","UTC")
    .config("spark.sql.shuffle.partitions","8")
    .getOrCreate()
)

spark.sparkContext.setLogLevel("WARN")

# 2. Affichage des versions pour ENV.md
print("=== ENVIRONMENT INFO ===")
print(f"Spark version: {spark.version}")
print(f"PySpark version: {pyspark.__version__}")
print(f"Python version:", sys.version.split()[0], "|", platform.platform())
print(f"Session timezone: {spark.conf.get('spark.sql.session.timeZone')}")


Run timestamp (UTC): 2025-11-23T13:52:56.224428


Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
25/11/23 14:52:59 WARN Utils: Your hostname, PCPORTABLEAUR, resolves to a loopback address: 127.0.1.1; using 10.255.255.254 instead (on interface lo)
25/11/23 14:52:59 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/11/23 14:53:01 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


=== ENVIRONMENT INFO ===
Spark version: 4.0.1
PySpark version: 4.0.1
Python version: 3.10.19 | Linux-6.6.87.2-microsoft-standard-WSL2-x86_64-with-glibc2.39
Session timezone: UTC


## 1. Load dataset

In [2]:
# write some code here
# - ensure data/shakespeare.txt exists (download if missing)
# - create an RDD of lines and a DataFrame with column 'line'
# - show a few lines

import os
from pathlib import Path
import urllib.request
from pyspark.sql import Row

BASE_DIR = Path.cwd()
DATA_DIR = BASE_DIR / "data"
OUTPUTS_DIR = BASE_DIR / "outputs"
PROOF_DIR = BASE_DIR / "proof"
for directory in (DATA_DIR, OUTPUTS_DIR, PROOF_DIR):
    directory.mkdir(exist_ok=True)


# 1. Vérifier que le fichier existe, sinon proposer un téléchargement
# data_path = "data/shakespeare.txt"
TEXT_PATH = DATA_DIR / "shakespeare.txt"

if not os.path.exists(TEXT_PATH):
    print(f" Le fichier {TEXT_PATH} est introuvable.")
    print(" Veuillez le télécharger depuis le site du cours et le placer dans le dossier 'data'.")
else:
    print(f" Fichier trouvé : {TEXT_PATH}")

# 2. Créer un RDD de lignes
raw_rdd = spark.sparkContext.textFile(str(TEXT_PATH)).cache()

# Vérification
print(f"Nombre total de lignes : {raw_rdd.count()}")

# 3. Créer un DataFrame à partir du RDD
lines_df = spark.read.text(str(TEXT_PATH)).withColumnRenamed("value", "line").cache()


# Materialize caches
raw_rdd.count()
lines_df.count()

# 4. Afficher les 10 premières lignes
lines_df.show(10, truncate=False)



 Fichier trouvé : /home/aurel/bda_labs/bda_assignment01/data/shakespeare.txt


                                                                                

Nombre total de lignes : 122458
+--------------------------------------------+
|line                                        |
+--------------------------------------------+
|1609                                        |
|                                            |
|THE SONNETS                                 |
|                                            |
|by William Shakespeare                      |
|                                            |
|                                            |
|                                            |
|                     1                      |
|  From fairest creatures we desire increase,|
+--------------------------------------------+
only showing top 10 rows


## 2. Part A — “perfect x” follower counts

In [3]:
# write some code here
# - tokenize lowercase, split on non-letters
# - for each line, if tokens[i]=='perfect' take tokens[i+1]
# - discard followers with count=1
# - write outputs/perfect_followers.csv
# - save explain('formatted') to proof/plan_perfect.txt

# ============================================================
# Part A — "perfect x" follower counts
# ============================================================

import re
from contextlib import redirect_stdout
from io import StringIO
from pyspark.sql import functions as F
import pandas as pd

pattern_perfect = re.compile(r"[a-z]+")

# Extraire les mots qui suivent immédiatement "perfect"
def followers(tokens):
    result = []
    for idx in range(len(tokens) - 1):
        if tokens[idx] == "perfect":
            follower = tokens[idx + 1]
            if follower:
                result.append(follower)
    return result

followers_rdd = (
    lines_df.rdd
    .map(lambda row: [token for token in pattern_perfect.findall(row.line.lower()) if token])
    .flatMap(followers)
)

followers_df = followers_rdd.map(lambda token: (token,)).toDF(["follower"])

# Comptage des occurrences
perfect_counts_df = (
    followers_df
    .groupBy("follower")
    .count()
    .filter(F.col("count") > 1)
    .orderBy(F.desc("count"), F.asc("follower"))
)


# Affichage des résultats
perfect_counts_df.show(truncate=False)


# Sauvegarde des résultats et de plan d'exécution
perfect_counts_df.toPandas().to_csv(OUTPUTS_DIR / "perfect_followers.csv", index=False)

plan_buffer = StringIO()
with redirect_stdout(plan_buffer):
    perfect_counts_df.explain("formatted")
(PROOF_DIR / "plan_perfect.txt").write_text(plan_buffer.getvalue())


                                                                                

+--------+-----+
|follower|count|
+--------+-----+
|in      |4    |
|love    |4    |
|honour  |2    |
|that    |2    |
|yellow  |2    |
+--------+-----+



                                                                                

3156

## 3. Part B — PMI with RDDs: pairs

In [4]:
# write some code here
# - parse --threshold K
# - keep first 40 tokens per line
# - compute counts for x and (x,y); then PMI=log10(P(x,y)/(P(x)P(x)))
# - filter by threshold; write outputs/pmi_pairs_sample.csv
# - save plan text to proof/plan_pmi_pairs.txt if DF used


import math
from itertools import combinations
from io import StringIO
from contextlib import redirect_stdout

MAX_TOKENS = 40
PMI_THRESHOLD = 5

def dedupe_preserve(tokens):
    seen = set()
    ordered = []
    for token in tokens:
        if token not in seen:
            seen.add(token)
            ordered.append(token)
    return ordered

pmi_token_pattern = re.compile(r"[a-z]+")

tokens_per_line = (
    lines_df.rdd
    .map(lambda row: [t for t in pmi_token_pattern.findall(row.line.lower())][:MAX_TOKENS])
    .map(lambda tokens: [t for t in tokens if t])
    .map(dedupe_preserve)
    .filter(lambda tokens: len(tokens) > 1)
    .cache()
)

num_docs = tokens_per_line.count()

from operator import add

marginal_counts = (
    tokens_per_line
    .flatMap(lambda tokens: ((token, 1) for token in tokens))
    .reduceByKey(add)
)

marginal_dict = dict(marginal_counts.collect())
marginal_bc = spark.sparkContext.broadcast(marginal_dict)

pair_counts = (
    tokens_per_line
    .flatMap(lambda tokens: [((min(a, b), max(a, b)), 1) for a, b in combinations(tokens, 2)])
    .reduceByKey(add)
    .filter(lambda kv: kv[1] >= PMI_THRESHOLD)
)

def compute_pair_pmi(kv):
    (x, y), co_count = kv
    count_x = marginal_bc.value.get(x)
    count_y = marginal_bc.value.get(y)
    if not count_x or not count_y:
        return None
    pmi = math.log10((co_count * num_docs) / (count_x * count_y))
    return (x, y, float(pmi), int(co_count))

pmi_pairs_rdd = pair_counts.map(compute_pair_pmi).filter(lambda row: row is not None)

pairs_df = spark.createDataFrame(pmi_pairs_rdd, schema=["x", "y", "pmi", "count"]).orderBy(F.desc("pmi"))

pairs_df.show(10, truncate=False)

pairs_df.toPandas().to_csv(OUTPUTS_DIR / "pmi_pairs_sample.csv", index=False)

plan_buffer = StringIO()
with redirect_stdout(plan_buffer):
    pairs_df.explain("formatted")
(PROOF_DIR / "plan_pmi_pairs.txt").write_text(plan_buffer.getvalue())



                                                                                

+---------+---------+------------------+-----+
|x        |y        |pmi               |count|
+---------+---------+------------------+-----+
|mell     |pell     |4.267109122884396 |5    |
|sauf     |votre    |4.091017863828714 |5    |
|jourdain |margery  |4.011836617781089 |5    |
|cine     |med      |4.003867688109814 |7    |
|phrynia  |timandra |3.9660791272204143|5    |
|dogberry |verges   |3.920321636659739 |6    |
|dit      |il       |3.789987868164733 |5    |
|clitus   |dardanius|3.753004301911563 |5    |
|envoy    |l        |3.72304107853412  |14   |
|cleomenes|dion     |3.644845745468146 |7    |
+---------+---------+------------------+-----+
only showing top 10 rows


                                                                                

1466

## 4. Part B — PMI with RDDs: stripes

In [5]:
# write some code here
# - build stripes x -> map[y -> count] with combiners
# - reuse univariate counts; compute PMI with log10
# - threshold K; write outputs/pmi_stripes_sample.csv
# - plan to proof/plan_pmi_stripes.txt if DF used

from collections import Counter
from contextlib import redirect_stdout
from io import StringIO


def stripe_builder(tokens):
    for x in tokens:
        counter = Counter()
        for y in tokens:
            if y != x:
                counter[y] += 1
        if counter:
            yield (x, counter)

def merge_counters(c1, c2):
    c1.update(c2)
    return c1

def stripe_to_rows(item):
    x, counter = item
    count_x = marginal_bc.value.get(x)
    if not count_x:
        return []
    rows = []
    for y, co_count in counter.items():
        if co_count >= PMI_THRESHOLD:
            count_y = marginal_bc.value.get(y)
            if not count_y:
                continue
            pmi = math.log10((co_count * num_docs) / (count_x * count_y))
            rows.append((x, y, float(pmi), int(co_count)))
    return rows

stripes_counts = (
    tokens_per_line
    .flatMap(stripe_builder)
    .reduceByKey(merge_counters)
)

pmi_stripes_rdd = stripes_counts.flatMap(stripe_to_rows)

stripes_df = spark.createDataFrame(pmi_stripes_rdd, schema=["x", "y", "pmi", "count"]).orderBy(F.desc("pmi"))

stripes_df.show(10, truncate=False)

stripes_df.toPandas().to_csv(OUTPUTS_DIR / "pmi_stripes_sample.csv", index=False)

plan_buffer = StringIO()
with redirect_stdout(plan_buffer):
    stripes_df.explain("formatted")
(PROOF_DIR / "plan_pmi_stripes.txt").write_text(plan_buffer.getvalue())
    


                                                                                

+--------+--------+------------------+-----+
|x       |y       |pmi               |count|
+--------+--------+------------------+-----+
|pell    |mell    |4.267109122884396 |5    |
|mell    |pell    |4.267109122884396 |5    |
|sauf    |votre   |4.091017863828714 |5    |
|votre   |sauf    |4.091017863828714 |5    |
|margery |jourdain|4.011836617781089 |5    |
|jourdain|margery |4.011836617781089 |5    |
|med     |cine    |4.003867688109814 |7    |
|cine    |med     |4.003867688109814 |7    |
|timandra|phrynia |3.9660791272204143|5    |
|phrynia |timandra|3.9660791272204143|5    |
+--------+--------+------------------+-----+
only showing top 10 rows


                                                                                

1510

## 5. Spark UI evidence
Open http://localhost:4040 during runs and capture Files Read, Input Size, Shuffle Read/Write.

## 6. Environment and reproducibility

In [6]:
# write some code here
# - print Java version, Spark conf, OS info
# - save ENV.md: versions + key configs

import json
import subprocess

def get_java_version():
    try:
        output = subprocess.check_output(["java", "-version"], stderr=subprocess.STDOUT)
        return output.decode("utf-8").strip().splitlines()[0]
    except Exception as exc:
        return f"Unavailable ({exc})"

java_output = get_java_version()
print(f"Java: {java_output}")

print("Spark configuration (selected):")
conf_items = sorted(spark.sparkContext.getConf().getAll())
for key, value in conf_items:
    print(f" - {key} = {value}")

env_summary = {
    "python": sys.version,
    "spark": spark.version,
    "pyspark": pyspark.__version__,
    "java": java_output,
    "os": platform.platform(),
    "spark_conf": {k: v for k, v in conf_items if k.startswith("spark.")}
}

env_lines = [
    "# Environment Summary",
    "",
    f"- Python: {sys.version.split()[0]}",
    f"- Spark: {spark.version}",
    f"- PySpark: {pyspark.__version__}",
    f"- Java: {java_output}",
    f"- OS: {platform.platform()}",
    "",
    "## Spark Configuration"
]

env_lines.extend(f"- {k} = {v}" for k, v in env_summary["spark_conf"].items())

ENV_PATH = Path("ENV.md")
ENV_PATH.write_text("\n".join(env_lines) + "\n")

print(f"Environment details saved to {ENV_PATH.resolve()}")



Java: openjdk version "21.0.6-internal" 2025-01-21
Spark configuration (selected):
 - spark.app.id = local-1763018437583
 - spark.app.name = BDA-A01
 - spark.app.startTime = 1763018436868
 - spark.app.submitTime = 1763018436561
 - spark.driver.extraJavaOptions = -Djava.net.preferIPv6Addresses=false -XX:+IgnoreUnrecognizedVMOptions --add-modules=jdk.incubator.vector --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.lang.invoke=ALL-UNNAMED --add-opens=java.base/java.lang.reflect=ALL-UNNAMED --add-opens=java.base/java.io=ALL-UNNAMED --add-opens=java.base/java.net=ALL-UNNAMED --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.util.concurrent=ALL-UNNAMED --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED --add-opens=java.base/jdk.internal.ref=ALL-UNNAMED --add-opens=java.base/sun.nio.ch=ALL-UNNAMED --add-opens=java.base/sun.nio.cs=ALL-UNNAMED --add-opens=java.base/sun.security.action=ALL-UNNAMED -