# Practice Questions

### Conceptual Questions

#### 1. What is an EC2 instance in simple terms?

**Answer**: An EC2 instance is a virtual machine (VM) running on AWS's physical servers. It's a software-based computer that you can configure with your choice of operating system, CPU, memory, storage, and networking. You use EC2 instances to run applications, host websites, process data, or perform any computing task in the cloud.

**Key points**:
- Virtual computer in the cloud
- Fully configurable (OS, size, storage)
- Pay only for what you use
- Can be started, stopped, or terminated on demand

#### 2. What is the difference between vertical scaling and horizontal scaling?

**Answer**:

**Vertical Scaling (Scale Up/Down)**:
- **Changing the size of a single instance**
- Example: t3.medium → m5.2xlarge
- Requires stopping the instance
- Limited by maximum instance size
- Simpler but not fault-tolerant
- **Analogy**: Upgrading your computer to a more powerful one

**Horizontal Scaling (Scale Out/In)**:
- **Adding or removing multiple instances**
- Example: 1 instance → 10 instances
- No downtime during scaling
- Virtually unlimited scaling
- Provides fault tolerance and high availability
- More complex (needs load balancer, cluster manager)
- **Analogy**: Hiring more workers instead of training one super-worker

**Example**:
- Vertical: One database server with 64 GB RAM → 128 GB RAM
- Horizontal: One web server → Five web servers behind a load balancer

#### 3. Why do big data systems prefer horizontal scaling?

**Answer**: Big data systems prefer horizontal scaling because:

1. **Data Volume Exceeds Single Machine Capacity**
   - A single machine cannot store petabytes of data
   - HDFS distributes data across many machines
   - Example: 1 PB dataset split across 100 nodes

2. **Parallel Processing**
   - Data processing can be parallelized
   - Spark/Hadoop divide work across executors
   - 10 machines = 10x processing power (ideally)

3. **Fault Tolerance**
   - Hardware failures are inevitable at scale
   - Data replication protects against node failure
   - System continues working if nodes fail

4. **Cost Effectiveness**
   - Cheaper to use many commodity servers
   - More flexible than buying massive single servers
   - Can add capacity incrementally

5. **No Single Point of Failure**
   - Distributed architecture ensures availability
   - One node failure doesn't break the system

6. **Linear Scalability**
   - Adding nodes increases capacity proportionally
   - Vertical scaling hits a ceiling
   - Can grow to thousands of nodes

**Example**: 
- Hadoop cluster: 100 nodes, each with 10 TB = 1 PB total capacity
- If one node fails, only 1% of data is affected (and replicated elsewhere)

#### 4. If you terminate an EC2 instance, why might AWS still charge you?

**Answer**: Even after terminating an EC2 instance, you may still be charged for:

1. **EBS Volumes** (if not deleted with instance)
   - Root and additional volumes may persist
   - Check "Delete on Termination" setting
   - You're charged for storage (GB/month)

2. **Elastic IP Addresses** (if not released)
   - Elastic IPs cost money when NOT attached to a running instance
   - Must explicitly release Elastic IP to stop charges

3. **EBS Snapshots**
   - Snapshots created before termination remain
   - Stored in S3, charged per GB/month
   - Must manually delete snapshots

4. **AMIs** (Custom Amazon Machine Images)
   - If you created AMIs from the instance
   - The AMI and associated snapshots are charged

5. **Load Balancers**
   - If instance was registered with an ELB
   - Load balancer continues running and charging

6. **NAT Gateways**
   - If instance was using NAT Gateway for internet access
   - NAT Gateway charges continue independently

7. **Data Transfer Costs**
   - Any data transfer OUT of AWS before termination
   - Billed at month-end

**Best Practice**:
- Review all associated resources before termination
- Use AWS Cost Explorer to identify ongoing charges
- Set up billing alerts
- Tag resources for easier tracking

## Coding Question - Shell Commands

**Scenario**: You have just connected to an EC2 instance using SSH.

**Task**:
1. Check your current working directory
2. List all files in the current directory
3. Create a new directory named `data_eng`
4. Move into the `data_eng` directory

**Solution**:

```bash
# 1. Check current working directory
pwd
# Output: /home/ec2-user (or similar)

# 2. List all files in current directory
ls
# Or for detailed listing with hidden files:
ls -la

# 3. Create new directory named data_eng
mkdir data_eng

# 4. Move into the data_eng directory
cd data_eng

# Verify you're in the correct directory
pwd
# Output: /home/ec2-user/data_eng
```

**Alternative - All commands in sequence**:
```bash
pwd && ls -la && mkdir data_eng && cd data_eng && pwd
```

**Explanation**:

- **`pwd`** (Print Working Directory): Shows your current location in the file system
- **`ls`**: Lists files and directories in current location
  - `ls -la`: Shows all files including hidden (.) files with detailed info
- **`mkdir`**: Creates a new directory
- **`cd`**: Changes directory (navigates to a different folder)
- **`&&`**: Chains commands - executes next command only if previous succeeds

**Output you would see**:
```bash
[ec2-user@ip-172-31-0-1 ~]$ pwd
/home/ec2-user

[ec2-user@ip-172-31-0-1 ~]$ ls
documents  scripts

[ec2-user@ip-172-31-0-1 ~]$ mkdir data_eng

[ec2-user@ip-172-31-0-1 ~]$ cd data_eng

[ec2-user@ip-172-31-0-1 data_eng]$ pwd
/home/ec2-user/data_eng
```

## Additional Practice Exercises

### Exercise 1: Basic File Operations
```bash
# Create a file
touch test.txt

# Write content to file
echo "Hello EC2" > test.txt

# Display file content
cat test.txt

# Append to file
echo "Data Engineering" >> test.txt

# View file
cat test.txt
```

### Exercise 2: Working with Multiple Directories
```bash
# Create nested directories
mkdir -p projects/data/raw

# Navigate to nested directory
cd projects/data/raw

# Create a file there
touch dataset.csv

# Go back to home
cd ~

# List directory tree
tree projects  # (if tree is installed)
# Or
ls -R projects
```

### Exercise 3: Installing Software
```bash
# Update package manager
sudo yum update -y  # Amazon Linux / Red Hat
# or
sudo apt update     # Ubuntu

# Install Python
sudo yum install python3 -y

# Verify installation
python3 --version

# Install pip
sudo yum install python3-pip -y

# Install a Python package
pip3 install pandas
```

### Exercise 4: Checking System Resources
```bash
# Check disk space
df -h

# Check memory usage
free -h

# Check CPU information
lscpu

# Check running processes
top  # Press 'q' to quit

# Check network interfaces
ip addr
# or
ifconfig
```