# Data Pipelines: Exercise


## 1. Implement ETL Pipeline Functions

- Create Python functions to perform each step of a simple ETL pipeline for a CSV file:
    - **Extract:** Read data from a CSV file into a suitable data structure (e.g., pandas DataFrame).
    - **Transform:** Clean or modify the data as needed (e.g., remove missing values, standardize formats, filter rows).
    - **Load:** Save the processed data to a new CSV file or another destination.

Your functions should each handle one step, and be reusable for different CSV files.

In [None]:
# Your code here

## 2. Modularize Your ETL Pipeline

- Refactor your ETL code to improve structure and maintainability:
    - Organize your extraction, transformation, and load logic into clearly separated functions.
    - Ensure each function only handles one specific part of the process.
    - Pass data explicitly between steps (i.e., avoid using global variables).
    - Demonstrate the use of your functions by running the pipeline from start to finish on a sample CSV file.

This will help make your pipeline easier to read, test, and extend as your data or requirements change.


In [None]:
# Your code here

## 3. Add Logging
- Enhance the observability of your ETL pipeline by integrating logging into each step (extract, transform, load).
- Use Python’s built-in `logging` module to record key events, errors, and data quality checks.
- Ensure that logs capture the start and end of each step, as well as any important parameters or metrics (e.g., number of records processed).
- Logging will help with debugging, monitoring, and maintaining your pipeline in production environments.


In [None]:
# Your code here

## 4. Schedule Your ETL Pipeline with Airflow

- Build a basic Airflow DAG to orchestrate your ETL pipeline.
- Schedule the DAG to run once per day.
- Use PythonOperator (or equivalent) to call your extract, transform, and load functions.
- Ensure your DAG includes task dependencies and basic documentation.

In [None]:
# Your code here

## 5. Error Handling and Retries
- Enhance your ETL pipeline by adding robust error handling and automatic retries for each step.
- Use try-except blocks to gracefully catch and log errors that occur during extraction, transformation, or loading.
- Implement a retry mechanism (with a delay) to handle temporary failures (e.g., file not found, network issues), ensuring your pipeline is more resilient and reliable.


In [None]:
# Your code here

---
### Challenge

- Create a configuration file (YAML or JSON) to store and manage your ETL pipeline parameters (such as input/output file paths, filtering options, column names, etc.).
- Load these parameters in your ETL script and use them to control pipeline behavior.
- This approach makes your pipeline more flexible, reusable, and easier to maintain.


```yaml

# config.yaml
Your code here
```
