# Table of Contents
- [Introduction to Dataops](#-introduction-to-dataops)
- [DataOps Automation](#dataops-automation)
- [IAC](#infrastructure-as-code-iac)

# Introduction to DataOps

**DataOps** is a collaborative data management practice focused on improving the communication, integration, and automation of data flows between data managers and data consumers across an organization.  
It aims to deliver **faster, more reliable, and higher-quality** data analytics through the adoption of agile development, DevOps practices, and lean manufacturing principles.

---

## 🏛 The Three Pillars of DataOps

1. **Automation**  
   - Streamlines repetitive processes using tools and scripts.
   - Enables rapid, reliable, and scalable data pipeline deployments.
   
2. **Observability & Monitoring**  
   - Ensures full visibility into the performance, health, and data quality across pipelines.
   - Uses monitoring tools and alerts to proactively detect issues.
   
3. **Incident Response**  
   - Defines structured processes for identifying, triaging, and resolving data issues quickly.
   - Reduces downtime and minimizes business impact.

---

## 📌 Pillars Diagram

![DataOps Pillars](./images/dataops_pillers.png)

---

## ⚙️ Automation in DataOps

Automation is a core pillar of DataOps, enabling teams to reduce manual intervention, minimize human errors, and accelerate delivery.

### **Key Automation Practices**
1. **Continuous Integration and Continuous Delivery (CI/CD)**  
   - Automates the process of building, testing, integrating, and deploying data pipelines.
   
2. **Infrastructure as Code (IaC)**  
   - Uses code to define, provision, and manage infrastructure resources.
   - Ensures reproducibility and scalability.

---

## 🛠 Tools for Automation

Some popular tools used for automation in DataOps include:

- **Terraform** – For provisioning and managing infrastructure as code.  
- **Ansible** – For configuration management and deployment automation.  
- **Jenkins / GitHub Actions / GitLab CI** – For CI/CD pipeline automation.  
- **Apache Airflow / Prefect** – For orchestrating data workflows.

---

## ⚙️ Automation Diagram

![Automation](./images/automation.png)

# DataOps Automation

## 📌 What is DataOps Automation?
**DataOps Automation** is the practice of streamlining and automating every stage of a **data pipeline** — from ingestion to transformation to delivery — in order to:
- Reduce manual intervention.
- Improve reliability and consistency.
- Enable faster and safer deployments.
- Integrate best practices from **DevOps** into data engineering workflows.

It borrows concepts from **DevOps automation** like:
- **CI/CD (Continuous Integration / Continuous Delivery)**
- **Version Control**
- **Infrastructure as Code (IaC)**
- **Orchestration with DAGs** (Directed Acyclic Graphs)

---

## 🛠 Levels of DataOps Automation

### 1️⃣ No Automation
- All processes are run **manually** by engineers.
- Time-consuming, prone to human error, and difficult to scale.

![Manual Automation](./images/manul.png)

---

### 2️⃣ Pure Scheduling (Semi-Automation)
- Each stage of the pipeline runs on a **fixed schedule**.
- Improves consistency, but lacks dynamic triggers and dependency management.

![Semi Automation](./images/semi_automatio.png)

---

### 3️⃣ Fully Automated with Orchestration (e.g., Apache Airflow)
- Pipelines are defined as a **Directed Acyclic Graph (DAG)**.
- Orchestration tools like **Apache Airflow** ensure tasks run in the right order, only when dependencies are met.
- Enables retries, error handling, and monitoring.

![Fully Automated](./images/fully_autmated_via_apache_airflow.png)

---

## 🔄 CI/CD in DataOps
**Continuous Integration / Continuous Delivery** automates:
1. **Build** – Prepare code and configurations.
2. **Test** – Automatic review and testing of new code or data transformations.
3. **Integrate** – Merge tested changes into the main pipeline.
4. **Deploy** – Automatic delivery into production.

This approach ensures rapid, reliable updates to both **code and data**.

![CI/CD](./images/ci-cd.png)

---

## 💻 Infrastructure as Code (IaC)
- Maintain infrastructure configurations as **code**.
- Example: Provisioning cloud storage, compute resources, and databases through code files.
- Benefits:
  - Version control for infrastructure.
  - Reproducibility.
  - Easy rollback to previous setups.

![Infrastructure as Code](./images/iac.png)

---

## 📂 Version Control for Code & Data
- Tracks **changes** in both:
  - **Pipeline code** (SQL, Python, configs).
  - **Data versions** moving through the pipeline.
- Enables rollback to **previous versions** in case of errors.

![Version Control](./images/version_control.png)

---

## 🚀 Why DataOps Automation Matters
- **Consistency** – Fewer errors and more predictable results.
- **Speed** – Faster deployments and updates.
- **Scalability** – Handle large, complex pipelines without bottlenecks.
- **Resilience** – Automatic error handling, monitoring, and quick rollbacks.

In short, DataOps Automation ensures that **data pipelines run like a well-oiled factory line** — continuously delivering trusted data products at high speed and with minimal manual effort.

# Infrastructure as Code (IaC)

## 📌 What is Infrastructure as Code?
**Infrastructure as Code (IaC)** is the practice of defining, deploying, and maintaining infrastructure using **code**, rather than manual processes.  
With IaC, you can automate the creation of **networking, security, computing, storage**, and other resources required for your cloud-based data pipelines.

Benefits include:
- **Automation** – Reduce manual effort and human error.
- **Scalability** – Deploy infrastructure for large, complex systems quickly.
- **Consistency** – Ensure all environments match the desired configuration.
- **Version Control** – Track changes and roll back when necessary.

---

## 🕰 History of IaC

![History of IaC](./images/iac_history.png)

1. **1970s – Configuration Management**
   - Engineers used scripts (like early **BASH**) to automate configuration of physical machines.
   - Primitive automation for repetitive setup tasks.

2. **2006 – AWS EC2 Launch**
   - Cloud computing became widely accessible.
   - Developers could **spin up virtual servers on demand**.

3. **2010s – Modern IaC Tools**
   - Tools like **Terraform**, **AWS CloudFormation**, and **Ansible** emerged.
   - Enabled full infrastructure provisioning via code files instead of manual configuration.

---

## ⚙️ How Terraform Works

Terraform is a **cloud-agnostic** IaC tool created by **HashiCorp**.  
It uses **HCL (HashiCorp Configuration Language)**, a **declarative** language, to define the desired end state of infrastructure.

### Example: S3 Bucket Setup in Terraform

![Terraform S3 Config](./images/terraform_s3_config.png)

**Key points in Terraform syntax:**
- `resource` → Keyword indicating the type of entity you want to create.
- `"aws_s3_bucket"` → **Resource type** (provider + service).
- `"data_lake"` → **Resource name** (your internal reference).
- `{ ... }` → **Configuration block** with key-value pairs.

---

## 🖋 Example: Creating a VPC and EC2 Instance

![VPC & Instance](./images/vpc_instance_iac.png)
```hcl
#VPC Creation
resource "aws_vpc" "main" {
  cidr_block       = "10.0.0.0/16"
  instance_tenancy = "default"

  tags = {
    Name = "main"
  }
}

#EC2 Instance Creation
resource "aws_instance" "web" {
  ami           = data.aws_ami.ubuntu.id
  instance_type = "t3.micro"

  tags = {
    Name = "HelloWorld"
  }
}
```
---

## 🔄 Terraform Idempotency

![Idempotency](./images/idempotent.png)

Terraform ensures **idempotency** — running the same configuration multiple times **won’t recreate resources unnecessarily**:
- If the resource **does not exist**, Terraform creates it.
- If it **exists but differs** from the desired state, Terraform updates it.
- If it **matches exactly**, Terraform does nothing.

---

## 🆚 Bash vs Terraform
- **Bash scripts** are **imperative** → they execute commands in a specific order without checking the existing state.
- Running the same Bash provisioning script twice will create **duplicate resources**.
- **Terraform** is **declarative** → it checks the current state and only makes necessary changes.
- This makes Terraform **safe, repeatable, and idempotent**.

---

# Terraform setting up EC2 Instance

---

## 🧭 What We’ll Build (Big Picture)

![Region + Default VPC](./images/setting_up_ec2.png)

We’ll create a tiny EC2 instance in **`us-east-1`** (West Virginia) inside the **default VPC**. This is great for learning; in production you’d use a custom VPC.

---

## 🧩 Providers & Plugins — What They Are (and Why You Need Them)

![Provider anatomy](./images/provider.png)

**Terraform Core** understands `.tf` files and figures out a plan, but it **doesn’t know vendor APIs** by itself.  
That’s the job of a **provider plugin**:

- A **provider** (in Terraform terms) is a **binary plugin** that knows how to talk to an external system’s API (AWS, GCP, GitHub, Datadog, etc.).
- The provider implements CRUD operations for **resources** (e.g., `aws_instance`, `aws_s3_bucket`).
- When you run `terraform init`, Terraform **downloads the provider plugin** you declared in your config (from the Terraform Registry) and stores it under `.terraform/`.

**Two steps you always do:**
1. **Declare** that you need a provider (so Terraform downloads the plugin).
2. **Configure** the provider (region, credentials/profile, etc.).

---

## 🧰 Install & Prepare Terraform

![Install Terraform](./images/terraform_intall.png)

1. **Install Terraform CLI**
   - macOS (Homebrew): `brew tap hashicorp/tap && brew install hashicorp/tap/terraform`
   - Windows: Install from the official MSI or use `choco install terraform`.
   - Linux: Use your distro’s package manager or download the binary from HashiCorp.

2. **Set up AWS credentials** (pick one)
   - **AWS CLI profile**:  
     `aws configure --profile myprofile`
   - **Environment variables**:  
     `export AWS_ACCESS_KEY_ID=...`  
     `export AWS_SECRET_ACCESS_KEY=...`  
     (and optionally `AWS_SESSION_TOKEN` if you use SSO/STS)
   - **EC2 role** (if running Terraform **on** an EC2 instance)

3. **Create a new project folder**
   ```
   mkdir tf-ec2-lab && cd tf-ec2-lab
   ```

---

## 🔁 Terraform Workflow You’ll Use

![Terraform Steps](./images/terraform_steps.png)

1. **Write** `.tf` files  
2. **Init** → downloads provider plugin(s)  
3. **Plan** → shows the execution plan  
4. **Apply** → creates/updates/destroys resources

---

## 🧱 Code Anatomy: Blocks, Labels, Arguments

![Code blocks in editor](./images/terraform_code.png)
![Labels & types](./images/resourse__.png)

- **Block**: `keyword "label1" "label2" { ... }`
- **Examples**: `terraform {}`, `provider "aws" {}`, `resource "aws_instance" "web" {}`  
- **Arguments** inside `{}` are key–value pairs or nested blocks.

---

## 1) **Declare** the AWS Provider (so Terraform can download the plugin)

This lives in `main.tf`. It tells Terraform **which provider plugin** to get and which Terraform CLI versions are allowed.

```hcl
terraform {
  required_version = ">= 1.5.0"

  required_providers {
    aws = {
      source  = "hashicorp/aws"   # Download the AWS provider plugin from the Registry
      version = "~> 5.0"          # Stay within major version 5 for compatibility
    }
  }
}
```

**What each part does**
- `required_providers.aws.source` → Registry address for the **plugin**.
- `required_providers.aws.version` → Version range for stability.
- `required_version` → Guardrail for your local Terraform CLI version.

---

## 2) **Configure** the AWS Provider (region & auth settings)

This block configures the **runtime** for the provider plugin (e.g., which **region** to call).

```hcl
provider "aws" {
  region  = "us-east-1"   # Target AWS region for API calls
  # profile = "myprofile" # Optional: use a specific AWS CLI profile
}
```

- If you omit `profile`, Terraform uses the default credential chain:
  environment variables → shared credentials files → SSO/role → EC2 role, etc.
- You can keep `region` as a variable if you prefer (shown below).

![Provider block callouts](./images/terraform_provider.png)

---

## 3) **Create** an EC2 Instance (Simple Version)

![Resource type](./images/resourse.png)

At minimum you need **AMI** and **instance_type**.

```hcl
resource "aws_instance" "webserver" {
  ami           = "ami-0453ec754f44f9a4a"  # Replace with a valid AMI ID in your region
  instance_type = "t2.micro"

  tags = {
    Name = "ExampleServer"
  }
}
```

- `resource "aws_instance" "webserver"`: resource **type** and local **name**.
- `ami`: Operating system image.
- `instance_type`: Size (CPU/RAM).
- `tags`: helpful labels.

---

## 3b) **(Safer)** Lookup the Latest Amazon Linux 2 AMI Dynamically

Avoids hard‑coded AMI IDs.

```hcl
data "aws_ami" "al2" {
  most_recent = true

  filter {
    name   = "name"
    values = ["amzn2-ami-hvm-*-x86_64-gp2"]
  }

  owners = ["137112412989"] # Amazon
}

resource "aws_instance" "webserver" {
  ami           = data.aws_ami.al2.id
  instance_type = "t2.micro"

  tags = {
    Name = "ExampleServer"
  }
}
```

**Why this is better**
- The **data source** only **reads** info (does not create resources).
- You always get a recent, supported AMI in your region.

---



## 🧪 Run It

1) Initialize (downloads the **AWS provider plugin**)
```bash
terraform init
```

2) Preview what will happen
```bash
terraform plan
```

3) Create resources (you’ll be prompted to type `yes`)
```bash
terraform apply
```

4) Clean up later
```bash
terraform destroy
```

---

## 🧠 Extra Notes & Troubleshooting

- **Credentials**: If `plan/apply` fails with “no valid credentials,” set env vars or specify `profile` in the provider block.
- **Region mismatch**: AMI IDs are region‑specific. Use the **data source** approach to avoid mismatches.
- **Idempotency**: Terraform is **declarative** and **idempotent**—re‑running `apply` won’t duplicate resources if the real world already matches your config.




# Using Variables and Outputs in Terraform

![Hard-coded values example](./images/hard_code.png)

In the earlier config, values like **AWS region** and **EC2 instance name** were **hard-coded**. Replacing them with **variables** makes the code reusable, and exposing key attributes as **outputs** lets you print or pass data to other stacks.

---

## 1) Declare input variables

Create `variables.tf` and describe the inputs your config needs.

```hcl
variable "region" {
  description = "AWS region to deploy resources"
  type        = string
  default     = "us-east-1"
}

variable "serverName" {
  description = "Name tag for the EC2 instance"
  type        = string
}
```

**What this does**

- `variable "<name>"` defines an input.
- `description` documents it.
- `type` constrains values.
- `default` (optional). If omitted, Terraform will prompt (ask) for a value.

---

## 2) Use variables in your config

Wire those inputs into your provider and resources by referencing `var.<name>`.

```hcl
# providers.tf
provider "aws" {
  region = var.region
}

# main.tf
resource "aws_instance" "webserver" {
  ami           = "ami-0453ec754f44f9a4a"
  instance_type = "t2.micro"

  tags = {
    Name = var.serverName
  }
}
```

**What this does**

- `var.region` replaces the hard-coded region.
- `var.serverName` sets the `Name` tag dynamically.

---

## 3) Set variable values

You can pass values at apply time or via a `.tfvars` file.

```bash
# Option A: CLI flag
terraform apply -var="serverName=ExampleServer"

# Option B: terraform.tfvars (auto-loaded)
```

```hcl
# terraform.tfvars
serverName = "ExampleServer"
```

**What this does**

- CLI `-var` sets a one-off value.
- `terraform.tfvars` (or any `*.tfvars`) is auto-read, great for teams.

---

## 4) Declare outputs

Expose useful attributes (like the instance **ID** and **ARN**) so you can print or reuse them.

```hcl
# outputs.tf
output "server_id" {
  description = "The ID of the EC2 instance"
  value       = aws_instance.webserver.id
}

output "server_arn" {
  description = "The ARN of the EC2 instance"
  value       = aws_instance.webserver.arn
}
```

**What this does**

- `output "<name>"` defines a value to return after `apply`.
- `value` reads an attribute from a resource using
  `resource_type.resource_name.attribute`.

---

## 5) Apply and read outputs

```bash
# Create or update infra
terraform apply

# Show all outputs
terraform output

# Show a single output
terraform output server_id
```

**What this does**

- `apply` evaluates variables, creates/updates resources, then prints outputs.
- `terraform output` queries stored output values anytime after an apply.

---

## 6) File layout (recommended)
since terraform assume all tf files as one big folder it is safe to use differnt tf file for usage for better understading 
Keep things tidy by splitting concerns:

```
.
├─ main.tf         # resources
├─ providers.tf    # provider + terraform blocks
├─ variables.tf    # inputs
├─ outputs.tf      # outputs
└─ terraform.tfvars# values for inputs (excluded from VCS if sensitive)
```

This organization scales as your workspace grows. Variables make configs **portable**, and outputs make results **discoverable** across modules/workspaces.