🎥 Recommended Video: [Hadoop Explained in 4 minutes](https://www.youtube.com/watch?v=8pqvKo2Of50)

🎥 Recommended Video: [What exactly is Apache Spark? | Big Data Tools](https://www.youtube.com/watch?v=ymtq8yjmD9I)



## **Introduction**
Welcome to the world of **Big Data and Cloud Computing**! Over the next 3 weeks, we’ll explore how to handle massive amounts of data, process it efficiently, and use cloud platforms to scale your data science projects. Think of this as learning how to manage a library that’s growing faster than you can imagine, and using super-powered tools to organize and analyze it all.

---

## **Week 1: Introduction to Distributed Computing**

### **What is Distributed Computing?**
Imagine you have a giant puzzle to solve, but it’s too big for one person to handle. Distributed computing is like splitting the puzzle into smaller pieces and giving each piece to a different person to solve. Once everyone finishes, you combine the results to see the full picture.

In the tech world, **Hadoop** and **Spark** are like the team leaders that help you split and process big data across multiple computers.

---

### **Hadoop**
- **What it does**: Hadoop is like a librarian who stores and organizes books (data) in a big library (cluster of computers).
- **Key Components**:
  - **HDFS (Hadoop Distributed File System)**: Stores data across multiple machines.
  - **MapReduce**: Processes data in parallel across the cluster.

#### **Example: Counting Words in a Book**
Let’s say you want to count how many times the word "data" appears in a giant book. Hadoop splits the book into chapters, gives each chapter to a different computer, and then combines the results.

```python
# Pseudocode for MapReduce
def map(chapter):
    words = chapter.split()
    return [(word, 1) for word in words if word == "data"]

def reduce(results):
    return sum(results)

# MapReduce in action
chapters = ["chapter1.txt", "chapter2.txt", "chapter3.txt"]
results = [map(chapter) for chapter in chapters]
total = reduce(results)
print(f"The word 'data' appears {total} times.")
```

---

### **Spark**
- **What it does**: Spark is like a faster, more efficient librarian. It keeps data in memory (RAM) instead of reading from disk, making it much quicker.
- **Why use Spark?**: It’s great for real-time data processing and machine learning.

#### **Example: Analyzing Social Media Posts**
Imagine you want to analyze millions of tweets to find trending topics. Spark can process this data in seconds.

```python
# PySpark example
from pyspark.sql import SparkSession

# Start a Spark session
spark = SparkSession.builder.appName("TrendingTopics").getOrCreate()

# Load data
tweets = spark.read.json("tweets.json")

# Find trending hashtags
hashtags = tweets.filter(tweets.text.contains("#")).groupBy("hashtag").count()
hashtags.show()
```
