# Virtual Machines Fundamentals

**What is a Virtual Machine?**   

A **Virtual Machine (VM)** is a **software-based computer** that simulates a physical computer system. Think of it as a computer within a computer.

**Key Characteristics:**    
- **Software-based**: Runs as software on physical hardware
- **Multiple VMs per server**: One physical server can host many VMs
- **Resource allocation**: Each VM gets its own dedicated resources
- **Strong isolation**: VMs are isolated from each other for security

**Each VM Has:**     
1. **Its own CPU allocation** - Dedicated processing power
2. **Its own memory** - Separate RAM allocation
3. **Its own operating system** - Can run different OS than host
4. **Strong isolation** - Failures in one VM don't affect others

**Benefits of Virtualization**:     
- **Cost efficiency**: Better hardware utilization
- **Flexibility**: Easy to create, modify, or delete
- **Isolation**: Security and stability
- **Scalability**: Quick provisioning of new resources



# Amazon EC2 Overview

**Amazon Elastic Compute Cloud (EC2)** is AWS's service that provides virtual machines in the cloud.

## Core Concept
- **Each EC2 instance = One Virtual Machine**
- Running on AWS's physical infrastructure
- Fully managed and configurable

## What Users Choose When Creating an EC2 Instance

1. **Instance Type**
   - Determines CPU and memory specifications
   - Examples: t3.micro, m5.large, c5.xlarge

2. **Operating System (AMI)**
   - Linux distributions (Amazon Linux, Ubuntu, Red Hat)
   - Windows Server
   - Custom images

3. **Storage**
   - **EBS** (Elastic Block Store) volumes
   - Instance store (temporary)
   - Size and type (SSD, HDD)

4. **Network Configuration**
   - VPC (Virtual Private Cloud)
   - Subnet placement
   - Security groups
   - Public/Private IP



## EC2 Instance 

An EC2 Instance is **a running Virtual Machine** in AWS, created from an **AMI** (Amazon Machine Image).   

An instance includes:   
- **CPU**: Processing power from the instance type
- **Memory**: RAM allocation
- **Disk**: Storage volumes (EBS or instance store)
- **Network**: Network interfaces and IP addresses

### Instance Types

AWS offers different instance types optimized for **various workloads**. The naming convention follows a pattern: **Family + Generation + Size** (e.g., m5.2xlarge)

<img src='./pic/5_instance-types.png' width=400>


#### 1. General Purpose

**Characteristics**:
- Balanced CPU and memory ratio
- Good for general workloads

**Examples**:
- `t3, t3a, t4g` - Burstable performance, cost-effective
- `m5, m6i` - Steady-state workloads

**When to Use**:
- Balanced applications
- Web servers
- Code repositories
- Development/test environments



#### 2. Compute Optimized

**Characteristics**:
- More CPU power, less memory
- Optimized for compute-intensive tasks

**Examples**:
- `c5, c6i, c6g` - Latest generation compute instances

**When to Use**:
- High-performance computing
- Batch processing
- Gaming servers
- Scientific modeling
- Ad serving engines


#### 3. Memory Optimized

**Characteristics**:
- Large memory capacity per node
- Fast memory access

**Examples**:
- `r5, r6g` - Standard memory-optimized
- `x1, x2` - Extreme memory capacity

**When to Use**:
- In-memory databases (Redis, Memcached)
- Real-time big data processing
- High-performance databases
- SAP HANA



#### 4. Storage Optimized

**Characteristics**:
- High disk I/O throughput
- Large local storage capacity

**Examples**:
- `i3` - NVMe SSD storage
- `d2` - Dense HDD storage

**When to Use**:
- NoSQL databases (Cassandra, MongoDB)
- High I/O databases
- Data warehousing
- Distributed file systems (HDFS)
- Log processing



#### 5. Accelerated Computing

**Characteristics**:
- GPU or specialized hardware
- Parallel processing capabilities

**Examples**:
- `p3, p4` - GPU instances for ML training
- `g5` - Graphics-intensive applications

**When to Use**:
- Machine learning/ Deep learning training and inference
- High-performance computing
- Graphics processing (like rendering)
- Video transcoding



## AMI (Amazon Machine Image)

An **AMI** is a **template** used to create EC2 instances. Think of it as a snapshot or blueprint of a configured system.   



### AMI Contains:

1. **Operating System**
   - Linux (Amazon Linux, Ubuntu, CentOS, Red Hat)
   - Windows Server

2. **Pre-installed Software**
   - Applications and tools
   - Dependencies and libraries
   - Configuration files

3. **System Configurations**
   - User accounts
   - Security settings
   - Network configurations


<img src='./pic/5_ami-lifecycle-creation-process.jpg' width=800>



### Types of AMIs:

1. **AWS-provided AMIs**
   - Amazon Linux 2023
   - Ubuntu Server
   - Windows Server

2. **AWS Marketplace AMIs**
   - Pre-configured with **commercial software**
   - Vendor-maintained

3. **Custom AMIs**
   - Created from existing EC2 instances
   - Your own configurations



## EC2 Key Pairs and Security

A **Key Pair** is used for **secure SSH (Secure Shell) access** to EC2 instances. It's a **cryptographic authentication** method that's more secure than passwords.


### Key Pair Components:

Consists of public key and private key.  

#### Public Key
- Stored on the EC2 instance
- Installed during instance launch
- Located in `~/.ssh/authorized_keys`

#### Private Key
- Downloaded and kept by you at creation time
- **NEVER regenerated**, keep it safe!
- File extensions: `.pem` or `.ppk`

**Private Key Formats:**

1. `.pem` Format (Privacy Enhanced Mail)
   - **For**: OpenSSH
   - **Used by**: 
     - Mac (Terminal)
     - Linux (SSH command)
     - Modern Windows (PowerShell, Windows Terminal)
   - **Standard format** for AWS

2. `.ppk` Format (PuTTY Private Key)
   - **For**: PuTTY SSH client
   - **Used by**: Older Windows systems
   - Can convert from `.pem` using PuTTYgen



### Key Types:

#### 1. RSA (Rivest-Shamir-Adleman)
- **Most common and widely supported**
- Works with **all SSH clients**
- Standard key size: 2048 or 4096 bits
- Compatible with **older systems**

#### 2. ED25519 (EdDSA)
- **Newer, more efficient**
- Shorter keys with same security
- Faster operations
- Requires **modern SSH clients**


### Important Security Facts:

- **EC2 does NOT use passwords by default**
- Root/Administrator access only via key pair
- Lost private key = lost access (must create new instance)



### Key Pair Advantages:

1. **Strong Security**
   - 2048+ bit encryption
   - Nearly impossible to crack

2. **No Password Brute Force**
   - Cannot guess or crack like passwords
   - Immune to dictionary attacks

3. **Automated Access for servers**
   - Perfect for scripts and automation
   - No interactive password prompts




### Using Key Pairs:

```bash
# Connect using SSH with key pair
ssh -i /path/to/private-key.pem ec2-user@<instance-public-ip>

# Set correct permissions (required)
chmod 400 /path/to/private-key.pem
```



## EC2 Cost Management

### Billing Model

**EC2 instances are billed per second** (minimum 60 seconds) or per hour depending on the OS and instance type.

### Cost Factors:

1. **Instance Type**
   - Larger instances = higher cost
   - Example: m5.2xlarge costs more than t3.micro

2. **Running Time**
   - Charged only when instance is running
   - Stopped instances don't incur compute charges

3. **Region**
   - Prices vary by AWS region
   - US regions generally cheaper than Asia/Pacific

4. **Operating System**
   - Linux: Lower cost
   - Windows: Additional licensing fees
   - Commercial software: Extra charges

### Cost States:

#### Stopped Instance:
- ‚úÖ **Compute cost stops** (CPU, memory charges end)
- ‚ö†Ô∏è **Storage may still cost money** (EBS volumes)
- Instance is preserved and can be restarted

#### Terminated Instance:
- ‚úÖ **Compute cost stops** immediately
- ‚ö†Ô∏è **Attached resources may still be billed**:
  - Elastic IPs (if not attached to running instance)
  - EBS volumes (if not deleted with instance)
  - Snapshots

### Cost Optimization Tips:

1. **Stop instances when not in use** (non-production)
2. **Terminate unused instances** completely
3. **Right-size your instances** (don't over-provision)
4. **Use Reserved Instances** for predictable workloads (up to 75% savings)
5. **Consider Spot Instances** for fault-tolerant workloads (up to 90% savings)
6. **Delete unattached EBS volumes**
7. **Release unused Elastic IPs**



## Connecting to EC2 Instances

AWS provides three main methods to connect to EC2 instances.

### 1. SSH Client (Traditional Method)

Secure Shell protocol connection using a key pair.

**Requirements**:
- Private key file (`.pem`)
- Port 22 open in security group
- SSH client installed (built-in on Mac/Linux)

**How to connect**:
```bash
ssh -i /path/to/key.pem ec2-user@<public-ip>
```

**Best for**:
- Command-line users
- Automation and scripts
- Developers and administrators

**Advantages**:
- Full control and flexibility
- Works from any location
- Can use SSH tunneling and port forwarding

**Disadvantages**:
- Requires key management
- Must configure security groups
- Public IP needed


### 2. EC2 Instance Connect (AWS Console)

**What it is**: Temporary SSH access through AWS Console

**Requirements**:
- Web browser
- AWS Console access
- Port 22 open in security group

**How it works**:
- AWS temporarily uploads a public key to the instance
- Valid for 60 seconds
- No need to manage your own keys

**Best for**:
- Quick troubleshooting
- Users without SSH keys
- Temporary access

**Advantages**:
- No key management needed
- Browser-based
- Automatic credential rotation

**Disadvantages**:
- Requires internet access
- Still needs port 22 open
- Limited to AWS Console



### 3. Session Manager (SSM - AWS Systems Manager)

Browser-based secure access through AWS Systems Manager, no need of SSH key. 

<img src='./pic/5_session_manager.png' width=500>

**Requirements**:
- **SSM Agent installed** on instance (pre-installed on Amazon Linux 2023)
- **IAM role** attached to instance

**Best for**:
- Maximum security
- Instances in private subnets
- Compliance requirements

**Advantages**:
- ‚úÖ **No SSH key required**
- ‚úÖ **No open inbound ports needed** (most secure)
- ‚úÖ Works with private instances
- ‚úÖ Session logging for auditing
- ‚úÖ Browser-based or CLI

**Disadvantages**:
- Requires IAM configuration
- SSM Agent must be running
- Slight learning curve

**[How to use](https://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/connect-to-an-amazon-ec2-instance-by-using-session-manager.html)**:
1. Attach IAM role with `AmazonSSMManagedInstanceCore` policy
2. Connect via AWS Console ‚Üí Systems Manager ‚Üí Session Manager
3. Or use AWS CLI: `aws ssm start-session --target <instance-id>`



### Comparison Table

| Feature | SSH Client | EC2 Instance Connect | Session Manager |
|---------|-----------|---------------------|-----------------|
| Requires key pair | Yes | No | No |
| Port 22 open | Yes | Yes | No |
| Works in private subnet | No | No | Yes |
| Browser-based | No | Yes | Yes |
| Session logging | Manual | No | Yes |
| Best security | Good | Good | Excellent |



# EC2 Instance Lifecycle

An EC2 instance moves through different states during its lifetime. Understanding these states is crucial for cost management and operations.

### Instance States:

#### 1. Pending
- Instance is being launched
- Transitioning to running state
- Not yet accessible

#### 2. Running
- ‚úÖ Instance is powered on and operational
- üí∞ **You are charged for compute** (CPU, memory)
- Can connect and use the instance
- Applications are running

#### 3. Stopping
- Transition state from running to stopped
- Instance is shutting down gracefully

#### 4. Stopped
- ‚õî Instance is powered off
- ‚úÖ **No compute charge** (CPU, memory)
- üí∞ **EBS volumes are still charged**
- Instance data is preserved
- Can be started again later

#### 5. Rebooting
- OS-level restart
- Instance remains on the same physical host
- Similar to restarting your computer
- Doesn't change instance metadata

#### 6. Shutting Down
- Transition state before termination
- Final shutdown process

#### 7. Terminated
- Instance is permanently deleted
- Cannot be restarted
- All instance data is lost (unless EBS volume set to persist)

### State Transition Diagram:

```
    Launch
       ‚Üì
   [Pending]
       ‚Üì
   [Running] ‚Üê‚Üí [Rebooting]
       ‚Üì
   [Stopping]
       ‚Üì
   [Stopped]
       ‚Üì
   [Starting]
       ‚Üì
   [Running]
       ‚Üì
[Shutting Down]
       ‚Üì
  [Terminated]
```

### Important Lifecycle Considerations:

#### Public IP Behavior:
- **Ephemeral by default**: Changes after stop/start
- Lost on stop, new IP assigned on start
- To keep same IP: Use **Elastic IP** (static IP address)

#### Storage Behavior:
- **EBS-backed root volume** (most common):
  - Data persists when stopped
  - Charged for storage even when stopped
  - Can be configured to delete on termination

- **Instance store** (temporary):
  - Data lost when stopped or terminated
  - Survives reboots only

### Actions and Their Effects:

| Action | Compute Cost | Storage Cost | Data Preserved | Public IP |
|--------|-------------|--------------|----------------|-----------|
| Stop | Stops | Continues (EBS) | Yes | Lost* |
| Reboot | Continues | Continues | Yes | Kept |
| Terminate | Stops | Stops** | No*** | Lost |

\* Unless using Elastic IP  
\** Unless EBS volume set to persist  
\*** Root volume deleted unless configured otherwise

### Best Practices:

1. **Stop instances** during non-business hours (dev/test)
2. **Terminate unused instances** permanently
3. **Use Elastic IPs** for instances that need consistent addressing
4. **Tag instances** with automatic shutdown schedules
5. **Enable termination protection** for production instances
6. **Configure EBS volumes** to persist if needed after termination



# Shell Basics for EC2

### What is a Shell?

A **shell** is a command-line interface (CLI) that allows you to control the operating system through text commands.

### Purpose:
- Execute commands on the EC2 machine
- Automate tasks with scripts
- Manage files, processes, and system configurations
- Install and configure software

### Common Shells:

1. **bash (Bourne Again Shell)**
   - Most popular Linux shell
   - Default on most Linux distributions
   - Rich features and scripting capabilities

2. **sh (Bourne Shell)**
   - Original Unix shell
   - More basic, POSIX-compliant
   - Available on all Unix-like systems

3. **zsh (Z Shell)**
   - Modern shell with advanced features
   - Default on macOS

### How to Use Shell on EC2:

1. **Connect** via SSH, Instance Connect, or Session Manager
2. **Run commands** at the prompt:

```bash
# File and directory operations
ls              # List files
cd              # Change directory
pwd             # Print working directory
mkdir           # Make directory
rm              # Remove files

# Package management (Amazon Linux / Red Hat)
sudo yum update
sudo yum install package-name

# Package management (Ubuntu / Debian)
sudo apt update
sudo apt install package-name

# Big data related
java -version           # Check Java version
spark-submit app.py     # Submit Spark job
hdfs dfs -ls /          # List HDFS files
```

### Basic Shell Commands Reference:

#### Navigation:
```bash
pwd                    # Show current directory
ls                     # List files
ls -la                 # List all files with details
cd /path/to/directory  # Change directory
cd ..                  # Go up one directory
cd ~                   # Go to home directory
```

#### File Operations:
```bash
mkdir directory_name   # Create directory
touch filename         # Create empty file
cp source dest         # Copy file
mv source dest         # Move/rename file
rm filename            # Delete file
rm -rf directory       # Delete directory recursively
```

#### Viewing Files:
```bash
cat filename           # Display file contents
less filename          # View file page by page
head filename          # Show first 10 lines
tail filename          # Show last 10 lines
tail -f filename       # Follow file updates (logs)
```

#### System Information:
```bash
whoami                 # Current user
hostname               # Machine name
df -h                  # Disk space
free -h                # Memory usage
top                    # Process monitor
```



# EC2 Management via CLI

You can manage EC2 instances using AWS CLI (Command Line Interface) or AWS Console. CLI is preferred for automation.

### Key EC2 Management Tasks:

### 1. Create or Use an Existing Key Pair

```bash
# Create a new key pair
aws ec2 create-key-pair \
  --key-name my-key-pair \
  --query 'KeyMaterial' \
  --output text > my-key-pair.pem

# Set correct permissions
chmod 400 my-key-pair.pem

# List existing key pairs
aws ec2 describe-key-pairs
```

### 2. Launch EC2 Instances

```bash
# Launch an instance
aws ec2 run-instances \
  --image-id ami-xxxxxxxxx \
  --instance-type t3.micro \
  --key-name my-key-pair \
  --security-group-ids sg-xxxxxxxxx \
  --subnet-id subnet-xxxxxxxxx \
  --count 1 \
  --tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=MyInstance}]'
```

### 3. Attach Security Groups

```bash
# Create security group
aws ec2 create-security-group \
  --group-name my-sg \
  --description "My security group" \
  --vpc-id vpc-xxxxxxxxx

# Attach security group to instance
aws ec2 modify-instance-attribute \
  --instance-id i-xxxxxxxxx \
  --groups sg-xxxxxxxxx
```

### 4. Manage Security Group Rules

```bash
# Allow SSH access (port 22)
aws ec2 authorize-security-group-ingress \
  --group-id sg-xxxxxxxxx \
  --protocol tcp \
  --port 22 \
  --cidr 0.0.0.0/0

# Allow HTTP access (port 80)
aws ec2 authorize-security-group-ingress \
  --group-id sg-xxxxxxxxx \
  --protocol tcp \
  --port 80 \
  --cidr 0.0.0.0/0

# Remove a rule
aws ec2 revoke-security-group-ingress \
  --group-id sg-xxxxxxxxx \
  --protocol tcp \
  --port 22 \
  --cidr 0.0.0.0/0
```

### 5. Start / Stop / Reboot / Terminate Instances

```bash
# Start instance
aws ec2 start-instances --instance-ids i-xxxxxxxxx

# Stop instance
aws ec2 stop-instances --instance-ids i-xxxxxxxxx

# Reboot instance
aws ec2 reboot-instances --instance-ids i-xxxxxxxxx

# Terminate instance
aws ec2 terminate-instances --instance-ids i-xxxxxxxxx
```

### 6. Allocate and Attach Elastic IP

```bash
# Allocate Elastic IP
aws ec2 allocate-address --domain vpc

# Associate Elastic IP with instance
aws ec2 associate-address \
  --instance-id i-xxxxxxxxx \
  --allocation-id eipalloc-xxxxxxxxx

# Disassociate Elastic IP
aws ec2 disassociate-address \
  --association-id eipassoc-xxxxxxxxx

# Release Elastic IP
aws ec2 release-address \
  --allocation-id eipalloc-xxxxxxxxx
```

### 7. Describe Instances (Retrieve Metadata)

```bash
# List all instances
aws ec2 describe-instances

# Get specific instance details
aws ec2 describe-instances \
  --instance-ids i-xxxxxxxxx

# Get public IP address
aws ec2 describe-instances \
  --instance-ids i-xxxxxxxxx \
  --query 'Reservations[0].Instances[0].PublicIpAddress' \
  --output text

# List running instances
aws ec2 describe-instances \
  --filters "Name=instance-state-name,Values=running" \
  --query 'Reservations[*].Instances[*].[InstanceId,InstanceType,PublicIpAddress,Tags[?Key==`Name`].Value|[0]]' \
  --output table
```



# Scaling Strategies

Scaling is the process of adjusting compute capacity to meet demand. There are two fundamental approaches:

### Vertical Scaling (Scale Up / Scale Down)

**Definition**: Increase or decrease the CPU/memory of a **single** EC2 instance by changing its instance type.

#### How it Works:
```
t3.medium (2 vCPU, 4 GB RAM)
           ‚Üì
    Stop instance
           ‚Üì
  Change instance type
           ‚Üì
m5.2xlarge (8 vCPU, 32 GB RAM)
```

#### Process:
1. **Stop the instance** (cannot change type while running)
2. **Change instance type** via console or CLI
3. **Start the instance** with new capacity

#### Characteristics:
- ‚úÖ **Simple to implement** - just change instance type
- ‚úÖ **No application changes** needed
- ‚ùå **Downtime required** during the change
- ‚ùå **Limited by maximum instance size** (finite ceiling)
- ‚ùå **Single point of failure** (not fault-tolerant)
- ‚ùå **Cannot scale beyond largest instance type**

#### Best Use Cases:
- Development/test environments
- Applications that cannot be distributed
- Databases (traditional RDBMS)
- Legacy applications
- Quick performance boost

#### Example:
```bash
# Stop instance
aws ec2 stop-instances --instance-ids i-xxxxxxxxx

# Wait for stopped state
aws ec2 wait instance-stopped --instance-ids i-xxxxxxxxx

# Change instance type
aws ec2 modify-instance-attribute \
  --instance-id i-xxxxxxxxx \
  --instance-type "{Value: m5.2xlarge}"

# Start instance
aws ec2 start-instances --instance-ids i-xxxxxxxxx
```

### Horizontal Scaling (Scale Out / Scale In)

**Definition**: Add or remove **multiple** EC2 instances to handle load by distributing work across many machines.

#### How it Works:
```
Single Instance
     ‚Üì
Add more instances
     ‚Üì
3 Instances (cluster)
     ‚Üì
Add more instances
     ‚Üì
10 Instances (cluster)
```

#### Characteristics:
- ‚úÖ **High availability** - failure of one instance doesn't break system
- ‚úÖ **Fault tolerance** - work redistributed on failure
- ‚úÖ **No downtime** during scaling
- ‚úÖ **Virtually unlimited scaling** - add as many instances as needed
- ‚úÖ **Better cost optimization** - add/remove as needed
- ‚ùå **More complex** - requires load balancing and coordination
- ‚ùå **Application must support** distributed architecture

#### Components Needed:
1. **Load Balancer** - Distributes traffic across instances
2. **Auto Scaling Group** - Automatically adds/removes instances
3. **Cluster Manager** - Coordinates distributed work (e.g., YARN, Kubernetes)
4. **Shared Storage** - S3, EFS, or HDFS for data access

#### Best Use Cases:
- **Web applications** - handle variable traffic
- **Microservices** - independent scaling of services
- **Big data processing** - Hadoop, Spark clusters
- **Containerized applications** - ECS, EKS
- **Stateless applications** - no local state dependency

#### Used By:
- ‚úÖ **Amazon EMR** - Managed Hadoop/Spark clusters
- ‚úÖ **Amazon ECS/EKS** - Container orchestration
- ‚úÖ **Spark clusters** - Distributed data processing
- ‚úÖ **Hadoop clusters** - Distributed storage and compute
- ‚úÖ **Web application clusters** - Behind load balancers

### Comparison Table:

| Aspect | Vertical Scaling | Horizontal Scaling |
|--------|-----------------|-------------------|
| **Approach** | Bigger machine | More machines |
| **Downtime** | Yes | No |
| **Complexity** | Low | High |
| **Fault Tolerance** | No | Yes |
| **Scaling Limit** | Instance size limit | Virtually unlimited |
| **Cost** | Can be expensive | More cost-effective |
| **Application Changes** | None required | May require changes |
| **Typical Use** | Databases, legacy apps | Web apps, big data |

### When to Use Each:

#### Use Vertical Scaling When:
- Application cannot be distributed
- Need quick performance improvement
- Simple architecture preferred
- Temporary capacity increase
- Development/testing

#### Use Horizontal Scaling When:
- Need high availability
- Traffic is variable/unpredictable
- Application is stateless or distributed
- Processing big data
- Long-term production workloads

### Hybrid Approach:

Many systems use **both strategies**:
1. **Vertically scale** to right-size individual instances
2. **Horizontally scale** by adding more optimally-sized instances

**Example**: 
- Start with m5.xlarge instances
- Horizontal scale: 5 instances ‚Üí 10 instances (for more traffic)
- Vertical scale: m5.xlarge ‚Üí m5.2xlarge (if each instance needs more power)



# EC2 in Distributed Systems

### Key Concept: EC2 is the Building Block

**Important**: EC2 is not the system itself - it's the fundamental component used to build distributed systems and clusters.

```
Physical Servers
       ‚Üì
   EC2 Instances (Virtual Machines)
       ‚Üì
   Cluster/Distributed System
       ‚Üì
Big Data Framework (Hadoop, Spark)
```

### How EC2 is Used in Clusters

#### Basic Principle:
- **Multiple EC2 instances form a cluster**
- Each instance plays a specific role
- Instances work together as a coordinated system

#### Instance Roles:

1. **Master / Driver / Coordinator**
   - Controls and coordinates the cluster
   - Distributes tasks to workers
   - Monitors cluster health
   - Examples:
     - Hadoop NameNode
     - Spark Driver
     - Kubernetes Master Node

2. **Worker / Executor / Slave**
   - Executes assigned tasks
   - Processes data
   - Reports back to master
   - Examples:
     - Hadoop DataNode
     - Spark Executor
     - Kubernetes Worker Node

### Why Clusters Are Needed

#### 1. Data is Too Large for a Single Machine
- **Problem**: 100 TB dataset won't fit on one machine
- **Solution**: Distribute data across many machines
- **Example**: HDFS splits data into blocks across DataNodes

#### 2. Workload Can Be Parallelized
- **Problem**: Processing takes too long on one machine
- **Solution**: Divide work among many machines
- **Example**: Spark distributes transformations across executors

#### 3. Failures Are Expected and Handled
- **Problem**: Hardware failures are inevitable at scale
- **Solution**: Replication and fault tolerance
- **Example**: HDFS keeps 3 copies of each data block

### EC2 in Hadoop Ecosystem

#### Hadoop Cluster on EC2:

**Master Node (NameNode):**
- EC2 instance type: m5.xlarge or larger
- Role: Manages HDFS metadata
- Tracks which DataNode has which blocks

**Worker Nodes (DataNodes):**
- EC2 instance type: d2.xlarge (storage optimized)
- Role: Store HDFS data blocks
- Process MapReduce tasks

**Resource Manager:**
- EC2 instance: Can be on master or separate
- Role: YARN resource scheduling
- Allocates resources to applications

**Architecture**:
```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ  Master EC2 Node    ‚îÇ
‚îÇ  - NameNode         ‚îÇ
‚îÇ  - ResourceManager  ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
           ‚îÇ
    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
    ‚îÇ      ‚îÇ      ‚îÇ          ‚îÇ
‚îå‚îÄ‚îÄ‚îÄ‚ñº‚îÄ‚îÄ‚îê ‚îå‚îÄ‚ñº‚îÄ‚îÄ‚îÄ‚îÄ‚îê ‚îå‚îÄ‚ñº‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê ‚îÇ
‚îÇWorker‚îÇ ‚îÇWorker‚îÇ ‚îÇWorker  ‚îÇ ‚îÇ...
‚îÇEC2   ‚îÇ ‚îÇEC2   ‚îÇ ‚îÇEC2     ‚îÇ ‚îÇ
‚îÇNode1 ‚îÇ ‚îÇNode2 ‚îÇ ‚îÇNode3   ‚îÇ ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò ‚îÇ
```

### EC2 in Spark Ecosystem

#### Spark Cluster on EC2:

**Driver Node:**
- EC2 instance: m5.2xlarge
- Role: Spark application coordinator
- Creates execution plan
- Monitors executors

**Executor Nodes:**
- EC2 instances: r5.xlarge (memory optimized)
- Role: Execute Spark tasks
- Process data partitions
- Cache data in memory

**Architecture**:
```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ  Driver EC2 Node    ‚îÇ
‚îÇ  - Spark Driver     ‚îÇ
‚îÇ  - Web UI (4040)    ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
           ‚îÇ
    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
    ‚îÇ      ‚îÇ      ‚îÇ          ‚îÇ
‚îå‚îÄ‚îÄ‚îÄ‚ñº‚îÄ‚îÄ‚îÄ‚îÄ‚îê ‚îå‚îÄ‚ñº‚îÄ‚îÄ‚îÄ‚îÄ‚îê ‚îå‚îÄ‚ñº‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇExecutor‚îÇ ‚îÇExecutor‚îÇ ‚îÇExecutor‚îÇ
‚îÇEC2     ‚îÇ ‚îÇEC2     ‚îÇ ‚îÇEC2     ‚îÇ
‚îÇCores:4 ‚îÇ ‚îÇCores:4 ‚îÇ ‚îÇCores:4 ‚îÇ
‚îÇMem:16GB‚îÇ ‚îÇMem:16GB‚îÇ ‚îÇMem:16GB‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

### Amazon EMR (Elastic MapReduce)

**What is EMR?**
- **Managed Hadoop/Spark service** on AWS
- Automatically provisions and configures EC2 clusters
- Formula: **EMR = EC2 + Cluster Manager + Hadoop/Spark**

#### EMR Components:

1. **EC2 Instances**
   - Physical compute resources
   - You choose instance types

2. **Cluster Manager**
   - AWS manages cluster lifecycle
   - Handles instance failures
   - Scales cluster automatically

3. **Big Data Frameworks**
   - Pre-installed Hadoop, Spark, Hive, Presto
   - Configured and optimized by AWS

#### EMR Cluster Types:

**Primary Node** (formerly Master):
- m5.xlarge
- Runs NameNode, ResourceManager
- Coordinates cluster

**Core Nodes**:
- m5.xlarge, r5.xlarge
- Run DataNode, NodeManager
- Store HDFS data
- Cannot be removed without data loss

**Task Nodes** (optional):
- c5.xlarge
- Run NodeManager only
- No HDFS storage
- Can be added/removed (Spot instances work well)

### AWS Glue (Serverless Alternative)

**What is Glue?**
- **Serverless ETL service**
- No EC2 management required
- AWS provisions resources automatically

#### Glue vs EMR:

| Aspect | EMR (EC2-based) | Glue (Serverless) |
|--------|-----------------|-------------------|
| **EC2 Management** | You manage | AWS manages |
| **Cluster Setup** | Manual | Automatic |
| **Scaling** | Manual/Auto-scaling | Fully automatic |
| **Cost** | Pay for instances | Pay per job |
| **Control** | Full control | Limited control |
| **Best For** | Complex clusters | Simple ETL jobs |

### Other EC2 Cluster Examples

#### 1. Apache Airflow
- **Orchestration tool** for data pipelines
- **Scheduler Node**: EC2 instance running Airflow scheduler
- **Web Server Node**: EC2 instance for Airflow UI
- **Worker Nodes**: EC2 instances executing tasks
- **Database**: RDS or EC2 with PostgreSQL

#### 2. Kubernetes on EC2 (Amazon EKS)
- **Control Plane**: Managed by AWS (or on EC2)
- **Worker Nodes**: EC2 instances running containers
- **Pods**: Containerized applications
- Horizontal scaling via adding EC2 worker nodes

#### 3. Cassandra Cluster
- **Multiple EC2 nodes** (no single master)
- **Instance type**: i3.2xlarge (storage optimized)
- **Replication**: Data copied across nodes
- **Horizontal scaling**: Add more nodes

### Key Takeaways

1. **EC2 provides the infrastructure** - it's the foundation
2. **Software frameworks** (Hadoop, Spark) run on top of EC2
3. **Clusters distribute work** across multiple EC2 instances
4. **Each instance has a role** (master/worker)
5. **Managed services** (EMR, EKS) automate EC2 cluster management
6. **Serverless alternatives** (Glue) eliminate EC2 management entirely



# Summary



‚úÖ **Virtual Machine fundamentals** and how EC2 provides VMs in AWS  
‚úÖ **EC2 instance types** optimized for different workloads  
‚úÖ **AMIs** as templates for creating instances  
‚úÖ **Key pairs** for secure authentication  
‚úÖ **Cost management** and billing considerations  
‚úÖ **Multiple connection methods** (SSH, Instance Connect, Session Manager)  
‚úÖ **Instance lifecycle** and state transitions  
‚úÖ **Shell basics** for managing EC2 instances  
‚úÖ **CLI commands** for EC2 management  
‚úÖ **Scaling strategies** (vertical vs horizontal)  
‚úÖ **EC2's role in distributed systems** like Hadoop and Spark  
‚úÖ **Practice questions** and shell exercises



# Quick Reference Commands

## EC2 Management
```bash
# Launch instance
aws ec2 run-instances --image-id <ami> --instance-type t3.micro --key-name <key>

# List instances
aws ec2 describe-instances

# Start instance
aws ec2 start-instances --instance-ids <id>

# Stop instance
aws ec2 stop-instances --instance-ids <id>

# Terminate instance
aws ec2 terminate-instances --instance-ids <id>
```

## SSH Connection
```bash
# Connect with key
ssh -i key.pem ec2-user@<public-ip>

# Set key permissions
chmod 400 key.pem
```

## Shell Basics
```bash
pwd                    # Current directory
ls -la                 # List files
mkdir dir_name         # Create directory
cd dir_name            # Change directory
rm -rf dir_name        # Remove directory
cat file.txt           # View file
```

