# C1_W2_Assignment

## 🔍 Goal of the Assignment

In this assignment, you act as a **Data Engineer** at a retail company that sells scale models of classic cars. Your job is to build a simple batch data pipeline that:

1. **Extracts** customer and sales data from an Amazon RDS (MySQL) database,  
2. **Transforms** it into a star schema for analytical querying, and  
3. **Loads** it into an Amazon S3 bucket.

The transformed data can then be queried by analysts using **Amazon Athena**.

---

## 📌 What Was Provided

- A running Amazon RDS instance containing the classic car database (relational database).
- The ETL script that transforms the raw data into a star schema.
- Pre-written Terraform scripts to provision the AWS infrastructure (S3 buckets, Glue Jobs, etc.)
- A Jupyter Notebook to query the final output.

---

## 🛠️ Architectural Diagram

The image below gives a high-level overview of what you are building:

![Architectural Diagram](../image/archi_dia.png)

### 🧩 Components:

- **Amazon RDS**: Source MySQL database with tables like `products`, `orders`, `customers`, etc.
- **AWS Glue ETL**: Extracts raw data from RDS, transforms it into a star schema.
- **Amazon S3**: Target storage where the transformed data is loaded.
- **Glue Crawler**: Reads the data from S3 and updates the AWS Glue Data Catalog.  
  🔗 Learn more: [AWS Glue Data Catalog Docs](https://docs.aws.amazon.com/glue/latest/dg/components-overview.html)
- **Amazon Athena**: SQL-based querying engine to analyze data in S3.  
  🔗 Learn more: [What is Amazon Athena?](https://aws.amazon.com/athena/)  
  ✳️ **Note:** Athena is not a data warehouse. It allows SQL-like queries on S3 directly without managing infrastructure. Unlike Redshift, it's serverless and queries are slower for very large analytical jobs.

---

## 🔄 Pipeline Scenario

This diagram shows the original relational structure of your MySQL RDS database.

![Pipeline Scenario](../image/pipeline_scenario.png)

- You are tasked with extracting only the data that the analyst needs.
- You will transform this data into a format that is **easier and faster to query**.
- You will load this data into **S3** for long-term storage and access.

---

## 🔃 ETL Process

This is a classical **ETL** (Extract-Transform-Load) pipeline:

- **Extract**: Pull data from the source (RDS)
- **Transform**: Restructure it using a **star schema**
- **Load**: Save transformed data into an S3 bucket

The Glue job automates this process.

---

## 🌟 Structure of Transformed Data

The following image shows the **Star Schema** used to restructure the relational data.

![Star Schema](../image/star_schema.png)

### 🔷 What is a Star Schema?

A **star schema** is a type of database schema that is optimized for querying large-scale analytics data. It consists of a central **fact table** linked to multiple **dimension tables**.

🔗 Learn more: [Star Schema Explained](https://vertabelo.com/blog/data-warehouse-modeling-the-star-schema/)

### 🟨 Fact Table:
- `fact_orders`: central table with metrics like `quantityOrdered`, `orderAmount`, etc.

### 🟦 Dimension Tables:
- `dim_customers`, `dim_products`, `dim_locations`: provide context for analytical aggregations (e.g., sales by product, sales by country)

---

## 🔧 How This Was Built — Terraform

All infrastructure (Glue Jobs, Crawlers, S3 buckets) was created using **Terraform**, which is an Infrastructure as Code (IaC) tool.

![Terraform Setup](../image/terrafomr.png)

The Terraform code was already provided. In **Course 2**, you’ll learn how to write these scripts yourself.

---

## 💬 Notes for Future Reference

- The code and infra setup is **already provided**. You are not expected to write them from scratch for now.
- **Glue scripts, Lambda functions, and Terraform code** might look complex now — this is normal.
- Future courses will cover these in detail.