# Big Data Course Slides - Updated for 2025

## Slide 1: Introduction
**Original Content**: Simple "Introduction" title.
**Updated Content**:  
Welcome to the 2025 Big Data Course! This course explores the foundations and modern practices of handling massive datasets using distributed systems, cloud technologies, and parallel programming. Topics include Big Data concepts, distributed file systems (HDFS), and frameworks like Apache Hadoop and Spark, with practical applications in real-world scenarios.

---

## Slide 2: Course Objectives
**Original Content**: Lists course topics (Purpose, Big Data, Distributed Systems, Parallel Programming).
**Updated Content**:  
This week, we introduce:  
- **Course Goals**: Understand Big Data challenges and solutions.  
- **Big Data Fundamentals**: Managing and analyzing massive datasets.  
- **Distributed Systems**: Scalable architectures for data storage and processing.  
- **Parallel Programming**: Leveraging frameworks like Hadoop and Spark for efficient computation.  

**Updates**: Added Spark to reflect its prominence in 2025 alongside Hadoop.

---

## Slide 3: Why Big Data Matters
**Original Content**: Placeholder "Binto" and "$=$".
**Updated Content**:  
Big Data drives innovation across industries. According to recent industry reports (e.g., LinkedIn 2024), top skills include:  
1. **Cloud and Distributed Computing** (e.g., AWS, Azure, Hadoop, Spark).  
2. **Data Analytics and Machine Learning** (e.g., Python, R, TensorFlow).  
3. **Data Storage and Management** (e.g., SQL, NoSQL, cloud-native databases).  
These skills are critical for roles in AI, IoT, and real-time analytics.

**Updates**: Replaced vague placeholder with industry-relevant context and updated skills to 2024/2025 standards.

---

## Slide 4: Big Data in Context
**Original Content**: Highlights LinkedIn skills and course relevance.
**Updated Content**:  
Big Data is a cornerstone of modern technology. LinkedIn’s 2024 skills report emphasizes:  
1. **Cloud and Distributed Computing**: Tools like Hadoop, Spark, and Kubernetes dominate.  
2. **Data Analytics**: Proficiency in Python, R, and ML frameworks is essential.  
3. **Storage Systems**: SQL, NoSQL (e.g., MongoDB, Cassandra), and cloud storage (e.g., S3, GCS).  
This course equips you with these in-demand skills for 2025.

**Updates**: Updated skillset to include modern tools and cloud platforms.

---

## Slide 5: Understanding Data Scales
**Original Content**: Table of prefixes (kilo, mega, etc.) with examples.
**Updated Content**:  
Big Data involves massive scales. Key prefixes:  

| Prefix | Factor | Example |
|--------|--------|---------|
| Kilo (k) | $10^3$ | A text page (~1 KB) |
| Mega (M) | $10^6$ | Network transfer speed (~1 MB/s) |
| Giga (G) | $10^9$ | Small datasets (~1 GB) |
| Tera (T) | $10^{12}$ | Enterprise databases |
| Peta (P) | $10^{15}$ | Social media platforms (e.g., Meta, AWS) |
| Exa (E) | $10^{18}$ | Global internet data |
| Zetta (Z) | $10^{21}$ | Future AI training datasets |

**Updates**: Corrected table formatting, added modern examples, and fixed exponent for Giga ($10^9$ instead of $10^7$).

---

## Slide 6: What is Big Data?
**Original Content**: Defines Big Data with examples (Google, Facebook, Amazon).
**Updated Content**:  
Big Data refers to datasets too large for traditional systems to process efficiently. Examples in 2025:  
- **Google**: ~50 EB of indexed web data.  
- **Meta**: ~10 EB of user-generated content, with 20 PB/day added.  
- **AWS**: Petabytes of cloud-hosted data for enterprises.  
- **Scientific Data**: Telescopes (~1 PB/day), CERN (~500 PB stored).  
Big Data requires distributed storage and processing due to its volume, velocity, and variety.

**Updates**: Updated data volumes to 2025 estimates and emphasized the "3 Vs" of Big Data.

---

## Slide 7: Distributed Systems for Big Data
**Original Content**: Discusses data distribution, HDFS, and MapReduce.
**Updated Content**:  
Processing Big Data requires:  
- **Distributed Storage**: Data split across thousands of machines in data centers (e.g., HDFS, cloud storage like S3).  
- **Specialized Databases**: NoSQL systems (e.g., Cassandra, MongoDB, Elasticsearch).  
- **Parallel Processing**: Frameworks like MapReduce (Hadoop) and Spark for scalable computation.  
Traditional databases can’t handle petabyte-scale data, necessitating distributed architectures.

**Updates**: Added cloud storage and Spark, clarified limitations of traditional databases.

---

## Slide 8: Inside a Data Center
**Original Content**: Describes a cluster of 5000 connected computers.
**Updated Content**:  
Modern data centers host thousands of servers forming clusters:  
- **Scale**: 10,000+ servers in hyperscale facilities (e.g., Google, AWS).  
- **Connectivity**: High-speed networks (100 Gbps+) for data sharing.  
- **Redundancy**: Data replication ensures fault tolerance.  
This course focuses on programming for such clusters using Hadoop and cloud-native tools.

**Updates**: Scaled up to reflect 2025 data center sizes and included cloud-native tools.

---

## Slide 9: Server Hardware
**Original Content**: Describes blade servers and Google’s approach.
**Updated Content**:  
A typical 2025 server (e.g., rack or blade):  
- **Specs**: 8-16 CPU cores, 2-4 TB RAM, 50-100 TB NVMe storage.  
- **Cost**: ~$10,000, with prices dropping due to competition.  
- **Trends**: Google and AWS use commodity hardware for cost-efficiency, paired with custom accelerators (e.g., TPUs, GPUs).  
These servers form the backbone of Big Data processing.

**Updates**: Updated hardware specs and costs to 2025 standards, added accelerators.

---

## Slide 10: Connected Machines
**Original Content**: Discusses cloud storage and distributed execution.
**Updated Content**:  
Clusters connect servers to share:  
- **Storage**: Distributed file systems (e.g., HDFS, cloud storage like S3) with data replication.  
- **Computation**: Programs run across multiple nodes using frameworks like Hadoop, Spark, or Kubernetes.  
This course teaches you to develop applications for such distributed environments using Hadoop and modern APIs.

**Updates**: Added Spark and Kubernetes, emphasized cloud integration.

---

## Slide 11: Introduction to Hadoop
**Original Content**: Repeated "Hadoop" text.
**Updated Content**:  
**Apache Hadoop** is an open-source framework for distributed storage and processing:  
- **HDFS**: Hadoop Distributed File System for scalable storage.  
- **YARN**: Resource manager for job scheduling.  
- **MapReduce**: Programming model for parallel data processing.  
- **Ecosystem**: Integrates with Spark, Hive, and cloud platforms (e.g., AWS EMR).  
Hadoop remains relevant in 2025 for on-premises and hybrid cloud deployments.

**Updates**: Replaced placeholder with a clear Hadoop overview, added ecosystem context.

---

## Slide 12: Hadoop File System (HDFS)
**Original Content**: Title only.
**Updated Content**:  
**HDFS Overview**:  
- **Purpose**: Stores massive datasets across multiple machines.  
- **Features**:  
  - Fault-tolerant through data replication.  
  - Scalable to petabytes.  
  - Transparent access via a unified file system view.  
- **Use Cases**: Data lakes, batch processing, and analytics.

**Updates**: Completed the slide with a concise HDFS introduction.

---

## Slide 13: HDFS Features
**Original Content**: Describes HDFS as a distributed file system.
**Updated Content**:  
**HDFS Characteristics**:  
- **Tree Structure**: Organizes files and directories like Unix.  
- **Transparency**: Hides physical storage locations from users.  
- **Replication**: Files are copied (default: 3 replicas) for reliability and parallel access.  
- **Scalability**: Handles petabytes across thousands of nodes, integrated with cloud storage in 2025.

**Updates**: Clarified features and added cloud integration.

---

## Slide 14: HDFS File Organization
**Original Content**: Compares HDFS to Unix, lists directories.
**Updated Content**:  
**HDFS File Structure**:  
- **Root (/) Structure**: Similar to Unix, with directories like `/hbase`, `/tmp`, `/var`.  
- **User Space**: `/user/<username>` for personal files, distinct from `/home`.  
- **System Directories**: Includes `/user/hive`, `/user/spark`, `/user/history`.  
- **Permissions**: Supports owners, groups, and access rights like ext4.

**Updates**: Streamlined description, corrected directory names, and clarified user space.

---

## Slide 15: HDFS Commands
**Original Content**: Lists HDFS commands with notes on Java latency.
**Updated Content**:  
**Common HDFS Commands**:  
```bash
hdfs dfs -help          # Display help
hdfs dfs -ls <path>     # List files
hdfs dfs -cat <file>    # View file content
hdfs dfs -mv <src> <dst> # Move/rename
hdfs dfs -cp <src> <dst> # Copy
hdfs dfs -mkdir <dir>   # Create directory
hdfs dfs -rm -r <dir>   # Remove directory
```
**Note**: Commands may have slight latency due to Java-based Hadoop internals.

**Updates**: Formatted commands clearly, removed outdated flag notes, and retained latency comment.

---

## Slide 16: HDFS File Operations
**Original Content**: Shows commands for file transfers.
**Updated Content**:  
**Transferring Files with HDFS**:  
- **Upload**:  
  ```bash
  hdfs dfs -put <local> <hdfs_path>  # or -copyFromLocal
  hdfs dfs -mkdir -p /user/books
  hdfs dfs -put dracula /user/books
  ```
- **Download**:  
  ```bash
  hdfs dfs -get <hdfs_path> <local>  # or -copyToLocal
  hdfs dfs -get /user/books/center_earth
  ```

**Updates**: Simplified and standardized command examples.

---

## Slide 17: How HDFS Works
**Original Content**: Explains HDFS block structure and replication.
**Updated Content**:  
**HDFS Mechanics**:  
- **Block Size**: Default 256 MB (configurable, e.g., 128 MB in some setups).  
- **Replication**: Each block is replicated (default: 3) across different nodes for fault tolerance and parallel access.  
- **Distribution**: Blocks of a file are spread across nodes, not necessarily on the same machine.  
This ensures scalability and reliability in 2025’s hybrid cloud environments.

**Updates**: Updated block size context and added cloud reference.

---

## Slide 18: HDFS Cluster Roles
**Original Content**: Describes NameNode, Secondary NameNode, and DataNodes.
**Updated Content**:  
**HDFS Architecture**:  
- **NameNode**: Manages metadata (file names, block locations).  
- **Secondary NameNode**: Periodic backups of NameNode metadata.  
- **DataNodes**: Store actual data blocks.  
- **Clients**: Access points for interacting with the cluster.  
In 2025, high-availability setups often replace Secondary NameNode with standby NameNodes.

**Updates**: Clarified roles and noted high-availability trends.

---

## Slide 19: HDFS Node Schema
**Original Content**: Diagram of NameNode and DataNodes with file blocks.
**Updated Content**:  
**HDFS Data Distribution**:  
- **NameNode**: Stores metadata (e.g., `toto.txt` maps to blocks A, B, C).  
- **DataNodes**: Store blocks (e.g., DN1: A, B, C; DN2: A, C, D).  
- **Replication**: Each block is replicated across multiple DataNodes for reliability.  
Example: `toto.txt` blocks are distributed and replicated across DN1–DN4.

**Updates**: Simplified schema explanation, removed redundant DataNode list.

---

## Slide 20: HDFS Reliability
**Original Content**: Explains replication and NameNode importance.
**Updated Content**:  
**HDFS Fault Tolerance**:  
- **Replication**: Blocks are copied (default: 3) across DataNodes for redundancy and parallel access.  
- **NameNode**: Central metadata store; a failure can disrupt HDFS.  
- **Mitigation**: Secondary NameNode or standby NameNodes in high-availability mode ensure continuity.  
This is critical for 2025’s always-on data platforms.

**Updates**: Streamlined explanation and emphasized high-availability.

---

## Slide 21: High Availability in HDFS
**Original Content**: Describes high-availability mode with standby NameNodes.
**Updated Content**:  
**HDFS High Availability (HA)**:  
- **Standby NameNodes**: Two or more NameNodes in hot standby, ready to take over instantly.  
- **JournalNodes**: Synchronize metadata updates across NameNodes.  
- **Benefits**: Eliminates single point of failure, replaces Secondary NameNode.  
HA is standard in 2025 for production-grade Hadoop clusters.

**Updates**: Clarified HA mechanics and its prevalence in 2025.

---

## Slide 22: Java API for HDFS
**Original Content**: Title only.
**Updated Content**:  
**HDFS Java API Overview**:  
- **Purpose**: Programmatically interact with HDFS for file operations.  
- **Key Classes**: `Configuration`, `FileSystem`, `Path`.  
- **Use Cases**: Reading, writing, and managing files in HDFS.  
Examples follow in subsequent slides, updated for Hadoop 3.x APIs in 2025.

**Updates**: Completed the slide with an API introduction.

---

## Slide 23: HDFS File Information
**Original Content**: Incomplete Java code for listing file blocks.
**Updated Content**:  
**Listing HDFS File Blocks (Java)**:  
```java
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.*;
import java.io.IOException;

public class HDFSInfo {
    public static void main(String[] args) throws IOException {
        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(conf);
        Path path = new Path("/user/apitest.txt");
        FileStatus status = fs.getFileStatus(path);
        BlockLocation[] blocks = fs.getFileBlockLocations(status, 0, status.getLen());
        for (BlockLocation block : blocks) {
            System.out.println(block.toString());
        }
        fs.close();
    }
}
```
**Purpose**: Retrieves block locations for a file in HDFS.

**Updates**: Completed and corrected the code, updated for Hadoop 3.x compatibility.

---

## Slide 24: Reading an HDFS File
**Original Content**: Incomplete Java code for reading a file.
**Updated Content**:  
**Reading an HDFS File (Java)**:  
```java
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.*;
import java.io.BufferedReader;
import java.io.InputStreamReader;

public class HDFSRead {
    public static void main(String[] args) throws IOException {
        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(conf);
        Path path = new Path("/user/apitest.txt");
        FSDataInputStream in = fs.open(path);
        BufferedReader reader = new BufferedReader(new InputStreamReader(in));
        String line = reader.readLine();
        System.out.println(line);
        reader.close();
        fs.close();
    }
}
```
**Purpose**: Reads and prints the first line of a text file in HDFS.

**Updates**: Completed and corrected the code, ensured Hadoop 3.x compatibility.

---

## Slide 25: Writing an HDFS File
**Original Content**: Java code for writing a file.
**Updated Content**:  
**Writing an HDFS File (Java)**:  
```java
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.*;
import java.io.IOException;

public class HDFSWrite {
    public static void main(String[] args) throws IOException {
        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(conf);
        Path path = new Path("/user/apitest.txt");
        if (!fs.exists(path)) {
            FSDataOutputStream out = fs.create(path);
            out.writeUTF("Hello, Big Data World!");
            out.close();
        }
        fs.close();
    }
}
```
**Purpose**: Creates and writes a string to a file in HDFS.

**Updates**: Updated code for clarity and Hadoop 3.x compatibility.

---

## Slide 26: Compiling and Running HDFS Programs
**Original Content**: Compilation and execution commands with errors.
**Updated Content**:  
**Compiling and Running HDFS Programs**:  
1. **Compile**:  
   ```bash
   javac -cp $(hadoop classpath) HDFS*.java
   ```
2. **Package**:  
   ```bash
   jar cfe HDFSApp.jar HDFSWrite HDFS*.class
   ```
3. **Run**:  
   ```bash
   hadoop jar HDFSApp.jar HDFSWrite
   ```
**Note**: Use Hadoop 3.x classpath and ensure Java 11+ compatibility in 2025.

**Updates**: Corrected commands, updated for modern Hadoop and Java versions.

---

## Slide 27: Introduction to MapReduce
**Original Content**: Title "Algorithmes & MapReduce".
**Updated Content**:  
**MapReduce Overview**:  
- **Purpose**: Framework for parallel processing of large datasets.  
- **Components**:  
  - **Map**: Processes input data into key-value pairs.  
  - **Reduce**: Aggregates mapped data to produce final results.  
- **Relevance in 2025**: Complements Spark for batch processing in Hadoop ecosystems.

**Updates**: Added context and clarified MapReduce’s role alongside Spark.

---

## Slide 28: MapReduce Principles
**Original Content**: Describes MapReduce with sales examples.
**Updated Content**:  
**MapReduce Principles**:  
- **Goal**: Extract insights from large datasets.  
- **Examples**:  
  - Total sales of a product.  
  - Most expensive item.  
  - Average price.  
- **Process**:  
  - **Map**: Extracts relevant data per record.  
  - **Reduce**: Aggregates mapped data (e.g., sum, max, average).  
This approach scales to petabytes via distributed computing.

**Updates**: Simplified examples and emphasized scalability.

---

## Slide 29: MapReduce Example
**Original Content**: Table of car sales data.
**Updated Content**:  
**MapReduce Example: Car Sales**:  
Input data (car sales):  

| ID | Brand    | Model | Price  |
|----|----------|-------|--------|
| 1  | Renault  | Clio  | 4200   |
| 2  | Fiat     | 500   | 8840   |
| 3  | Peugeot  | 206   | 4300   |
| 4  | Peugeot  | 306   | 6140   |

**Task**: Calculate the maximum price.  
- **Map**: Extract price from each record.  
- **Reduce**: Find the maximum price (8840).

**Updates**: Clarified task and process, corrected table formatting.

---

## Slide 30: MapReduce Example (Continued)
**Original Content**: Explains Map and Reduce functions.
**Updated Content**:  
**MapReduce Workflow**:  
- **Map Function**: Extracts price from each car record.  
  ```python
  def getPrice(car):
      return car["price"]
  ```
- **Reduce Function**: Computes maximum price.  
  ```python
  def maxPrice(prices):
      return max(prices)
  ```
- **Output**: `max(map(getPrice, data))` yields 8840.  
This logic is parallelizable across clusters.

**Updates**: Provided clear Python pseudocode, emphasized parallelization.

---

## Slide 31: MapReduce in Python
**Original Content**: Python code for mapping prices.
**Updated Content**:  
**Python MapReduce Example**:  
```python
data = [
    {"id": 1, "brand": "Renault", "model": "Clio", "price": 4200},
    {"id": 2, "brand": "Fiat", "model": "500", "price": 8840},
    {"id": 3, "brand": "Peugeot", "model": "206", "price": 4300},
    {"id": 4, "brand": "Peugeot", "model": "306", "price": 6140}
]

# Map: Extract prices
prices = list(map(lambda car: car["price"], data))
print(prices)  # [4200, 8840, 4300, 6140]

# Reduce: Find maximum
max_price = max(prices)
print(max_price)  # 8840
```
**Purpose**: Demonstrates MapReduce in Python 3.

**Updates**: Updated to Python 3 syntax, used lambda for clarity.

---

## Slide 32: MapReduce Parallelization
**Original Content**: Explains map and reduce parallelization.
**Updated Content**:  
**Parallelizing MapReduce**:  
- **Map**: Fully parallelizable, as each record is processed independently.  
  ```python
  prices = [getPrice(car) for car in data]  # Can run on multiple nodes
  ```
- **Reduce**: Partially parallelizable in a hierarchical structure:  
  1. Compute intermediate results for value pairs.  
  2. Aggregate intermediates to final result.  
In 2025, frameworks like Spark optimize this further.

**Updates**: Simplified explanation, added Spark reference.

---

## Slide 33: YARN Overview
**Original Content**: Introduces YARN as a resource manager.
**Updated Content**:  
**YARN (Yet Another Resource Negotiator)**:  
- **Role**: Manages resources and schedules jobs in Hadoop clusters.  
- **Features**:  
  - Launches and monitors MapReduce or Spark jobs.  
  - Handles resource allocation and fault tolerance.  
  - Transparent to users, optimizing job execution.  
- **Relevance in 2025**: Integrates with Kubernetes for hybrid cloud deployments.

**Updates**: Added Spark and Kubernetes integration, clarified YARN’s role.

---

## Slide 34: MapReduce Framework
**Original Content**: Describes MapReduce as a Java environment.
**Updated Content**:  
**MapReduce Framework**:  
- **Purpose**: Java-based model for distributed data processing.  
- **Components**:  
  - **Mapper**: Processes input into key-value pairs.  
  - **Reducer**: Aggregates mapped pairs.  
- **Challenges**: Complex Java APIs; Spark is often preferred in 2025 for simplicity.  
- **Use Case**: Batch processing in Hadoop ecosystems.

**Updates**: Highlighted Spark’s prominence and simplified description.

---

## Slide 35: Key-Value Pairs in MapReduce
**Original Content**: Introduces key-value pairs.
**Updated Content**:  
**Key-Value Pairs**:  
- **Format**: Data is processed as (key, value) pairs (e.g., (line_number, text), (date, temperature)).  
- **Role**: Enables flexible data processing in MapReduce.  
- **Example**: A text file is split into (offset, line) pairs for mapping.  
This abstraction supports scalability across clusters.

**Updates**: Clarified concept with modern examples.

---

## Slide 36: Map Function Details
**Original Content**: Describes Map function behavior.
**Updated Content**:  
**Map Function**:  
- **Input**: A (key, value) pair (e.g., (offset, line)).  
- **Output**: Zero or more (key, value) pairs.  
- **Parallelization**: Each Map task runs independently on a data block.  
- **Example**: Extract phone call durations from (offset, call_record) pairs.  
YARN manages task distribution across nodes.

**Updates**: Streamlined explanation, added example.

---

## Slide 37: Reduce Function Details
**Original Content**: Describes Reduce function behavior.
**Updated Content**:  
**Reduce Function**:  
- **Input**: A key and a list of values from Map tasks.  
- **Output**: Typically one (key, value) pair per key.  
- **Parallelization**: YARN groups pairs by key, enabling hierarchical reduction.  
- **Example**: Sum call durations for a subscriber ID.  
Critical for aggregating large datasets efficiently.

**Updates**: Clarified process and added example.

---

## Slide 38: MapReduce Job Phases
**Original Content**: Lists MapReduce phases (preprocessing, split, map, shuffle, reduce).
**Updated Content**:  
**MapReduce Job Stages**:  
1. **Preprocessing**: Decompresses input files.  
2. **Split**: Divides data into (key, value) pairs (e.g., (line_number, line)).  
3. **Map**: Applies user-defined function to each pair.  
4. **Shuffle & Sort**: Groups pairs by key for Reduce tasks.  
5. **Reduce**: Aggregates values for each key.  
YARN orchestrates these stages across clusters.

**Updates**: Simplified and clarified stages.

---

## Slide 39: MapReduce Job Execution
**Original Content**: Describes YARN’s role in job execution.
**Updated Content**:  
**MapReduce Execution Flow**:  
1. **Data Location**: YARN queries NameNode for block locations.  
2. **Split**: Data is split into (key, value) pairs.  
3. **Map Tasks**: YARN launches Mappers on DataNodes.  
4. **Shuffle**: Sorts and redistributes pairs by key.  
5. **Reduce Tasks**: Aggregates results.  
In 2025, Spark often replaces MapReduce for faster execution.

**Updates**: Added Spark comparison, streamlined flow.

---

## Slide 40: Java API for MapReduce
**Original Content**: Title only.
**Updated Content**:  
**MapReduce Java API**:  
- **Purpose**: Develop distributed applications for Hadoop.  
- **Key Classes**: `Mapper`, `Reducer`, `Job`.  
- **Use Case**: Process large datasets with custom logic.  
Examples follow, updated for Hadoop 3.x and Java 11+.

**Updates**: Completed slide with API overview.

---

## Slide 41: Mapper Class
**Original Content**: Java code for Mapper subclass.
**Updated Content**:  
**Mapper Class Example**:  
```java
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;

public class TraitementMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String line = value.toString();
        Text outKey = new Text(line.split(",")[0]); // Example: Extract first column
        IntWritable outValue = new IntWritable(1);
        context.write(outKey, outValue);
    }
}
```
**Purpose**: Processes input lines into key-value pairs.

**Updates**: Updated for Hadoop 3.x, added example logic.

---

## Slide 42: Reducer Class
**Original Content**: Java code for Reducer subclass.
**Updated Content**:  
**Reducer Class Example**:  
```java
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;

public class TraitementReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
        int sum = 0;
        for (IntWritable val : values) {
            sum += val.get();
        }
        context.write(key, new IntWritable(sum));
    }
}
```
**Purpose**: Aggregates values for each key (e.g., summing counts).

**Updates**: Updated for Hadoop 3.x, clarified logic.

---

## Slide 43: Driver Class
**Original Content**: Java code for MapReduce driver.
**Updated Content**:  
**Driver Class Example**:  
```java
import org.apache.hadoop.conf.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class TraitementDriver extends Configured implements Tool {
    public static void main(String[] args) throws Exception {
        System.exit(ToolRunner.run(new TraitementDriver(), args));
    }

    @Override
    public int run(String[] args) throws Exception {
        Configuration conf = getConf();
        Job job = Job.getInstance(conf, "Traitement");
        job.setJarByClass(TraitementDriver.class);
        job.setMapperClass(TraitementMapper.class);
        job.setReducerClass(TraitementReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        return job.waitForCompletion(true) ? 0 : 1;
    }
}
```
**Purpose**: Configures and runs a MapReduce job.

**Updates**: Updated for Hadoop 3.x, added output types.

---

## Slide 44: Compiling and Running MapReduce
**Original Content**: Compilation and execution commands.
**Updated Content**:  
**Running a MapReduce Job**:  
1. **Compile**:  
   ```bash
   javac -cp $(hadoop classpath) Traitement*.java
   ```
2. **Package**:  
   ```bash
   jar cfe Traitement.jar TraitementDriver Traitement*.class
   ```
3. **Prepare**:  
   ```bash
   hdfs dfs -rm -r /user/output
   ```
4. **Run**:  
   ```bash
   yarn jar Traitement.jar TraitementDriver /user/input /user/output
   ```
**Note**: Use Java 11+ and Hadoop 3.x in 2025.

**Updates**: Corrected commands, updated for modern Hadoop.

---

## Slide 45: MapReduce in Practice
**Original Content**: Placeholder "Un schema".
**Updated Content**:  
**MapReduce Workflow Diagram**:  
- **Input**: Data split into blocks.  
- **Map**: Processes blocks into (key, value) pairs.  
- **Shuffle**: Groups pairs by key.  
- **Reduce**: Aggregates values per key.  
- **Output**: Final results stored in HDFS.  
This process is optimized by YARN for scalability.

**Updates**: Completed slide with a workflow description.

---

## Slide 46: MapReduce Example (Phone Calls)
**Original Content**: Phone call duration example.
**Updated Content**:  
**Example: Total Call Duration**:  
- **Input**: CSV file of calls (subscriber_id, called_number, date, duration).  
- **Map**: Emits (subscriber_id, duration) pairs.  
- **Reduce**: Sums durations per subscriber_id.  
- **Output**: Total call duration per subscriber.  
YARN distributes tasks across nodes for efficiency.

**Updates**: Clarified example and process.

---

## Slide 47: MapReduce Optimizations
**Original Content**: Notes on Map and Reduce instances.
**Updated Content**:  
**MapReduce Optimizations**:  
- **Map**: YARN runs one Mapper per node, processing multiple records sequentially.  
- **Reduce**: Hierarchical reduction for large datasets, with multiple Reducers.  
- **2025 Trends**: Spark often replaces MapReduce for faster, in-memory processing.  
These optimizations ensure scalability for massive datasets.

**Updates**: Added Spark comparison, simplified explanation.

---

## Slide 48: Conclusion
**Original Content**: Placeholder "Dnnate".
**Updated Content**:  
**Course Summary**:  
- **Big Data**: Handles massive, complex datasets.  
- **HDFS**: Scalable, fault-tolerant storage.  
- **MapReduce**: Parallel processing framework.  
- **Next Steps**: Explore Spark, cloud-native tools, and real-world applications in upcoming weeks.  
This course prepares you for 2025’s data-driven world.

**Updates**: Completed slide with a course wrap-up.

---