# Cloud & DevOps for Data Engineering: Exercise Results



## 1. Docker Practice
- Create a Dockerfile to containerize a basic Python script (e.g., prints "Hello, Docker!").
- Build the Docker image using your Dockerfile.
- Run a container from your image and confirm that the expected output is displayed.
- Briefly explain how containerization benefits data engineering workflows.


#### Answer 
- Create a Dockerfile to containerize a basic Python script (e.g., prints "Hello, Docker!").
- Build the Docker image using your Dockerfile.
- Run a container from your image and confirm that the expected output is displayed.
- Briefly explain how containerization benefits data engineering workflows.


In [None]:
# script.py
print("Hello, Docker!")


```dockerfile
# Dockerfile
FROM python:3.9
COPY script.py /app/script.py
WORKDIR /app
CMD ["python", "script.py"]
```



**Build the Docker image:**
```
docker build -t myapp .
```

**Run the container:**
```
docker run myapp
```
*Expected output:*
```
Hello, Docker!
```



**How containerization benefits data engineering workflows:**

- **Consistency:** Ensures that the application runs identically across different environments (development, testing, production).
- **Isolation:** Separates dependencies and configurations, preventing conflicts between projects.
- **Portability:** Makes it easy to move workloads between local machines, cloud, or on-premises environments.
- **Scalability:** Facilitates deploying and scaling services quickly in orchestration platforms like Kubernetes.


---
## 2. CI/CD Pipeline
- Build a GitHub Actions workflow that automatically runs your project’s test suite on every push and pull request.
- The workflow should use a suitable environment (e.g., Python, Node, etc.), install dependencies, and execute your tests.
- Ensure that failed tests prevent merging, promoting code quality through continuous integration.


```yaml
# yaml file

# .github/workflows/ci.yml
name: CI
on:
  push:
  pull_request:
jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.9'
      - name: Install dependencies
        run: |
          python -m pip install --upgrade pip
          pip install -r requirements.txt
      - name: Run tests
        run: pytest

```

---
## 3. Serverless Function (AWS Lambda)
- Develop a Python AWS Lambda handler that processes a simple event (e.g., returns a greeting or echoes input).
- Demonstrate how to test the Lambda function locally using AWS SAM CLI (or a similar tool).
- Explain the key components of the handler (event, context) and best practices for local testing before deploying to AWS.


In [None]:
def handler(event, context):
    return {'statusCode': 200, 'body': 'Hello from Lambda!'}


---
## 4. Infrastructure as Code (IaC)

- Use Terraform to define and provision cloud resources in a declarative way.
- Task: Write a Terraform configuration file that creates an S3 bucket in AWS.
- Use a test AWS account and choose a globally unique bucket name.
- This exercise helps you practice infrastructure automation and version control for cloud resources.


#### Terraform configuration to create an S3 bucket in AWS

```hcl
provider "aws" {
  region = "us-east-1"
}

resource "aws_s3_bucket" "test_bucket" {
  bucket = "dataeng-iac-test-bucket-20240615" # Replace with a globally unique name
  acl    = "private"
}

resource "aws_s3_bucket_tagging" "test_bucket_tags" {
  bucket = aws_s3_bucket.test_bucket.id

  tag {
    key   = "Environment"
    value = "Test"
  }
  tag {
    key   = "Purpose"
    value = "IaC Exercise"
  }
}


---

### Challenge
- Implement cloud-based monitoring for your data pipeline.
- Choose either AWS CloudWatch or GCP Stackdriver for this task.
- Set up logging to capture pipeline events and errors.
- Configure at least one alert/notification (alarm) for pipeline failures or anomalies.
- Briefly document your setup and explain how you would use these tools to troubleshoot and ensure pipeline reliability.


#### Answer

##### Cloud-Based Monitoring with AWS CloudWatch

**1. Logging Pipeline Events and Errors**

- Integrate Python's `logging` library with CloudWatch Logs using the `watchtower` library:


In [None]:
import logging
import watchtower

logger = logging.getLogger("etl_pipeline")
logger.setLevel(logging.INFO)
logger.addHandler(watchtower.CloudWatchLogHandler(log_group="etl-pipeline-logs"))

def extract(...):
    logger.info("Starting extract step")
    try:
        # extraction logic
        logger.info("Extract step completed")
    except Exception as e:
        logger.error(f"Extract error: {e}")
        raise



**2. Configuring CloudWatch Alarm for Pipeline Failures**

- Set up a metric filter in CloudWatch Logs:
    - Pattern: `{ $.levelname = "ERROR" }`
    - Metric namespace: `ETLPipeline`
    - Metric name: `ErrorCount`
- Create a CloudWatch Alarm:
    - Threshold: Trigger if `ErrorCount >= 1` in a 5-minute window
    - Action: Send notification to an SNS topic (email/SMS)



**3. Brief Documentation**

- All pipeline logs (info, errors) are streamed to a dedicated CloudWatch log group.
- A metric filter tracks error-level events in real-time.
- An alarm notifies the data engineering team immediately upon failures.
- **Troubleshooting:** Engineers use log search and filtering in CloudWatch Logs to diagnose issues, trace error stack traces, and confirm resolution post-remediation.
- **Reliability:** Automated alerting ensures rapid response to failures, minimizing pipeline downtime and improving operational reliability.

---

*You may use GCP Stackdriver (Cloud Logging/Monitoring) similarly: export logs, set up log-based metrics, and configure alerting policies for error events.*