# 📊 Week 1 — Exploratory Data Analysis (EDA): Reading Guide

**Learning Objectives (Week 1 – EDA)**  
- Understand the motivation for MLOps and how EDA fits into a production ML lifecycle.  
- Connect to Redshift and perform reproducible EDA.  
- Document data quality issues and define target/feature schema.  
- Prepare train/validation/test splits with leakage-aware methodology.  

> **Context**: ZAP is targeting **MLOps Level 2**. Even EDA should be reproducible and versioned (data query, sampling, and preprocessing code committed).

## 🔍 What is EDA and Why It Matters
Exploratory Data Analysis (EDA) is the process of **exploring, visualizing, and validating datasets** before training models.  
In **MLOps**, EDA is about much more than plots — it’s about **data reliability** and ensuring downstream pipelines are stable.

**Why it matters for production:**
- 🗑️ **Garbage in, garbage out** → poor data = poor models.  
- ⚡ **Operational resilience** → detect defects early, before they hit production.  
- 🔁 **Pipeline reliability** → schemas and checks from EDA become the foundation for automation.  


## 📐 Data Quality Dimensions
Checking data quality ensures your model won’t collapse when facing real-world inputs. Here are the key dimensions:

| Dimension    | Question to Ask | Example Issue |
|--------------|-----------------|---------------|
| ✅ Completeness | Are required values present? | Missing customer age |
| 🔄 Consistency | Do values follow expected formats/relations? | Country code "PT" inconsistently mapped |
| 🎯 Accuracy | Are values correct? | Negative product price |
| 🧩 Validity | Do values conform to rules/types? | Dates stored as free-text |
| ⏱️ Timeliness | Is the data up to date? | Using last year’s sales for today’s forecast |


## ⚠️ Leakage and Target Contamination
- **Data leakage** → using information not available at prediction time.  
- **Target contamination** → when the target leaks into features or data splits.  

❌ Example leakage: Using "credit approval status" as a feature to predict loan approval.  
❌ Example contamination: Randomly splitting time-series data, letting future events “leak” into training.

➡️ Both lead to inflated metrics **during training** and catastrophic failures **in production**.


## ♻️ Reproducibility
Reproducibility = **same results given same inputs**. Essential for trust, debugging, and collaboration.

Key practices:
- 🎲 **Fixed seeds** → ensure reproducible sampling/splitting.  
- 📑 **Deterministic queries** → e.g., always `ORDER BY id` in SQL.  
- 🖥️ **Environment capture** → record Python & library versions, OS, hardware.  

Without reproducibility → experiments can’t be compared, bugs can’t be traced.


## 📦 Outputs That Feed the Pipeline
EDA is not a one-off. Its **outputs become artifacts** for the ML pipeline:

- 🗂️ **Feature schema** → defines types, ranges, categories, nullability.  
- ✅ **Data checks** → rules like “no nulls in IDs” or “target is binary.”  
- ✂️ **Split strategy** → deterministic, leakage-free train/val/test partitions.  

These artifacts support:
- Automation in CI/CD ✅  
- Monitoring in production 📈  
- MLOps Level 2 maturity ⚙️  

---

# 📝 Exercises - Build the Dataset

You should choose any dataset existing on Redshift to practice EDA, and gather relevant information to train your model.  
Dataset example should contain customer demographics, services, account info, etc.


## 🔧 Setup
Use the function `load_data()` provided in file `data_io.py` snippet to create a dataset from `parquet` on S3 bucket or directly from `redshift`

In [None]:
#TODO

## 1. Data Overview & Metadata
Inspect the dataset:
- Number of rows and columns.  
- Data types of each column.  
- Identify categorical, numerical, and target.  

In [None]:
#TODO

## 2. Data Quality Checks
After identifying the tables you want to work on, a crucial step is to analyze their Data Quality using the following dimensions.
Check the **5 quality dimensions** on this dataset:

| Dimension    | Task |
|--------------|------|
| ✅ Completeness | Count missing/null values in each column. |
| 🔄 Consistency | Look for inconsistent categories (e.g., “Male” vs. “male”). |
| 🎯 Accuracy | Spot anomalies (e.g., negative charges). |
| 🧩 Validity | Ensure logics are met. Ex: `TotalProfit ≈ n_units × unitary_profit`. |
| ⏱️ Timeliness | Discuss whether tenure captures freshness of data. |

In [None]:
#TODO

## 3. Target Variable Exploration
- Plot the distribution of target table.  
- Discuss if the dataset is **imbalanced** and what that implies for modeling.

In [None]:
#TODO

## 4. Univariate Analysis
- For numerical columns:  
  - Plot histograms & boxplots.  
  - Identify outliers and skewed distributions.  

- For categorical columns:  
  - Plot bar charts of category counts.  
  - Check if categories have enough representation.

In [None]:
#TODO

## 5. Bivariate Analysis
- Compare target column across categorical columns :  

- Compare target column across numerical features:  
  - Ex: How does the values of a column grow proportinally and disproportionally in relation with another; Compare average column values for different target column values. 

In [None]:
#TODO

## 7. Reproducibility Practices
- Set a **random seed** when sampling rows for inspection.  
- Save an **EDA profile report** .  
- Export a **feature schema JSON** with column names, types, and allowed ranges/categories.  

In [None]:
#TODO

## 8. Train/Validation/Test Split Strategy
- Propose and implement a split strategy:  
  - Random stratified split by target column.  
  - Ensure reproducibility with a fixed random seed.  
  - Document why stratification is necessary here.  

In [None]:
#TODO

# 🎯 Deliverables
By the end of these exercises, you should have:
1. A **data dictionary**.  
2. Summary tables/plots of findings and key features.  
3. A **feature schema JSON** with data types and constraints.  
4. A **train/val/test split file** (e.g., `splits.json`) for reproducible downstream tasks.  

## Peer Validation
  - Reproducible data loading (query or seed).  
  - Clear schema with rationale per feature.  
  - Split method documented and leakage-safe.  
  - Artifacts present and versioned.

# 📊 Week 1 — Exploratory Data Analysis (EDA): Reading Guide

**Learning Objectives (Week 1 – EDA)**  
- Understand the motivation for MLOps and how EDA fits into a production ML lifecycle.  
- Connect to Redshift and perform reproducible EDA.  
- Document data quality issues and define target/feature schema.  
- Prepare train/validation/test splits with leakage-aware methodology.  

> **Context**: ZAP is targeting **MLOps Level 2**. Even EDA should be reproducible and versioned (data query, sampling, and preprocessing code committed).

## 🔍 What is EDA and Why It Matters
Exploratory Data Analysis (EDA) is the process of **exploring, visualizing, and validating datasets** before training models.  
In **MLOps**, EDA is about much more than plots — it’s about **data reliability** and ensuring downstream pipelines are stable.

**Why it matters for production:**
- 🗑️ **Garbage in, garbage out** → poor data = poor models.  
- ⚡ **Operational resilience** → detect defects early, before they hit production.  
- 🔁 **Pipeline reliability** → schemas and checks from EDA become the foundation for automation.  


## 📐 Data Quality Dimensions
Checking data quality ensures your model won’t collapse when facing real-world inputs. Here are the key dimensions:

| Dimension    | Question to Ask | Example Issue |
|--------------|-----------------|---------------|
| ✅ Completeness | Are required values present? | Missing customer age |
| 🔄 Consistency | Do values follow expected formats/relations? | Country code "PT" inconsistently mapped |
| 🎯 Accuracy | Are values correct? | Negative product price |
| 🧩 Validity | Do values conform to rules/types? | Dates stored as free-text |
| ⏱️ Timeliness | Is the data up to date? | Using last year’s sales for today’s forecast |


## ⚠️ Leakage and Target Contamination
- **Data leakage** → using information not available at prediction time.  
- **Target contamination** → when the target leaks into features or data splits.  

❌ Example leakage: Using "credit approval status" as a feature to predict loan approval.  
❌ Example contamination: Randomly splitting time-series data, letting future events “leak” into training.

➡️ Both lead to inflated metrics **during training** and catastrophic failures **in production**.


## ♻️ Reproducibility
Reproducibility = **same results given same inputs**. Essential for trust, debugging, and collaboration.

Key practices:
- 🎲 **Fixed seeds** → ensure reproducible sampling/splitting.  
- 📑 **Deterministic queries** → e.g., always `ORDER BY id` in SQL.  
- 🖥️ **Environment capture** → record Python & library versions, OS, hardware.  

Without reproducibility → experiments can’t be compared, bugs can’t be traced.


## 📦 Outputs That Feed the Pipeline
EDA is not a one-off. Its **outputs become artifacts** for the ML pipeline:

- 🗂️ **Feature schema** → defines types, ranges, categories, nullability.  
- ✅ **Data checks** → rules like “no nulls in IDs” or “target is binary.”  
- ✂️ **Split strategy** → deterministic, leakage-free train/val/test partitions.  

These artifacts support:
- Automation in CI/CD ✅  
- Monitoring in production 📈  
- MLOps Level 2 maturity ⚙️  

---

# 📝 Exercises - Build the Dataset

You should choose any dataset existing on Redshift to practice EDA, and gather relevant information to train your model.  
Dataset example should contain customer demographics, services, account info, etc.


## 🔧 Setup
Use the function `load_data()` provided in file `data_io.py` snippet to create a dataset from `parquet` on S3 bucket or directly from `redshift`

In [None]:
#TODO

## 1. Data Overview & Metadata
Inspect the dataset:
- Number of rows and columns.  
- Data types of each column.  
- Identify categorical, numerical, and target.  

In [None]:
#TODO

## 2. Data Quality Checks
After identifying the tables you want to work on, a crucial step is to analyze their Data Quality using the following dimensions.
Check the **5 quality dimensions** on this dataset:

| Dimension    | Task |
|--------------|------|
| ✅ Completeness | Count missing/null values in each column. |
| 🔄 Consistency | Look for inconsistent categories (e.g., “Male” vs. “male”). |
| 🎯 Accuracy | Spot anomalies (e.g., negative charges). |
| 🧩 Validity | Ensure logics are met. Ex: `TotalProfit ≈ n_units × unitary_profit`. |
| ⏱️ Timeliness | Discuss whether tenure captures freshness of data. |

In [None]:
#TODO

## 3. Target Variable Exploration
- Plot the distribution of target table.  
- Discuss if the dataset is **imbalanced** and what that implies for modeling.

In [None]:
#TODO

## 4. Univariate Analysis
- For numerical columns:  
  - Plot histograms & boxplots.  
  - Identify outliers and skewed distributions.  

- For categorical columns:  
  - Plot bar charts of category counts.  
  - Check if categories have enough representation.

In [None]:
#TODO

## 5. Bivariate Analysis
- Compare target column across categorical columns :  

- Compare target column across numerical features:  
  - Ex: How does the values of a column grow proportinally and disproportionally in relation with another; Compare average column values for different target column values. 

In [None]:
#TODO

## 7. Reproducibility Practices
- Set a **random seed** when sampling rows for inspection.  
- Save an **EDA profile report** .  
- Export a **feature schema JSON** with column names, types, and allowed ranges/categories.  

In [None]:
#TODO

## 8. Train/Validation/Test Split Strategy
- Propose and implement a split strategy:  
  - Random stratified split by target column.  
  - Ensure reproducibility with a fixed random seed.  
  - Document why stratification is necessary here.  

In [None]:
#TODO

# 🎯 Deliverables
By the end of these exercises, you should have:
1. A **data dictionary**.  
2. Summary tables/plots of findings and key features.  
3. A **feature schema JSON** with data types and constraints.  
4. A **train/val/test split file** (e.g., `splits.json`) for reproducible downstream tasks.  

## Peer Validation
  - Reproducible data loading (query or seed).  
  - Clear schema with rationale per feature.  
  - Split method documented and leakage-safe.  
  - Artifacts present and versioned.