# Day 4 Exercise: Data Warehousing, Data Modeling for Fraud Analytics & DevOps/CI-CD (Practical)

## Scenario
Building on the fraud detection rules implemented in PySpark from Day 3, you now need to operationalize the data pipeline by automating its deployment. The enriched transaction data and fraud flags (`sdf_final_transactions`) are stored in a data lake in CSV format. Your task is to create a CI/CD pipeline using GitHub Actions to deliver your PySpark code to a DEV environment and an Ansible playbook to deploy a basic infrastructure to support your pipeline. The focus is on practical implementation while ensuring secrets are protected and adhering to best practices for a robust deployment process.

## Data to Use 📊
You will use the conceptual output from the Day 3 exercise: a PySpark DataFrame (`sdf_final_transactions`) containing enriched transaction data with fraud flags (`is_fraudulent_rule1`, `is_fraudulent_rule2`). The data is stored in a data lake in CSV format (e.g., `dbfs:/mnt/datalake/fraud_detection/sdf_final_transactions/`).

## Tasks

### Part 1: CI/CD Pipeline with GitHub Actions (Practical)
**Objective**: Create a GitHub Actions workflow to automate the build, test, and deployment of your PySpark fraud detection script to an Azure Databricks DEV environment. Ensure secrets (e.g., Databricks token, storage credentials) are securely handled.

**Tasks**:
1. **GitHub Actions Workflow**:
   - Create a `.yml` file for a GitHub Actions workflow that triggers on push to a feature branch or pull request to `main`.
   - Include the following stages:
     - **Linting**: Check Python code for style and errors using `flake8`.
     - **Unit Testing**: Run unit tests for your PySpark script using `pytest`.
     - **Build**: Package the Python script and dependencies (e.g., create a `requirements.txt`).
     - **Deploy to DEV**: Deploy the script to an Azure Databricks workspace in the DEV environment using the Databricks CLI or API.
   - Use GitHub Secrets to securely store sensitive information like Databricks tokens and Azure Data Lake Storage credentials.
   - Ensure the workflow outputs logs for debugging but does not expose secrets.

2. **Project Structure**:
   - Structure your Git repository with the following:
     - `/src`: Contains the main PySpark script (`fraud_detection.py`).
     - `/tests`: Contains unit tests (`test_fraud_detection.py`).
     - `/requirements.txt`: Lists dependencies (e.g., `pyspark`, `pytest`, `flake8`).
     - `.gitignore`: Excludes sensitive files (e.g., `.env`, Databricks token files, temporary files).
   - Provide the content of `.gitignore` and `requirements.txt`.

3. **Unit Testing**:
   - Write a sample unit test in `test_fraud_detection.py` to verify one fraud rule (e.g., `is_fraudulent_rule1` for high transaction value for new customers).
   - Use a small, mocked dataset to simulate `sdf_final_transactions`.

### Part 2: Ansible Playbook for Infrastructure Deployment (Practical)
**Objective**: Create an Ansible playbook to deploy a basic infrastructure to support your PySpark fraud detection pipeline in the DEV environment. The infrastructure includes an Azure Databricks workspace and connectivity to an Azure Data Lake Storage account.

**Tasks**:
1. **Ansible Playbook**:
   - Create an Ansible playbook (`deploy_databricks_infra.yml`) to:
     - Provision an Azure Databricks workspace in the DEV environment (or configure an existing one).
     - Configure a mount point to the Azure Data Lake Storage account where the CSV data resides.
     - Set up a Databricks cluster with basic configurations (e.g., single node, Spark version compatible with your PySpark script).
   - Use Ansible vault to encrypt sensitive variables (e.g., Azure service principal credentials, Databricks workspace tokens).
   - Ensure the playbook is idempotent (can be run multiple times without errors).

2. **Inventory and Variables**:
   - Provide an Ansible inventory file (`inventory.yml`) defining the target environment (e.g., localhost for Azure CLI execution).
   - Provide a variable file (`vars.yml`) for non-sensitive configurations (e.g., Databricks workspace name, cluster settings).
   - Describe how to use Ansible vault for sensitive variables (e.g., `ansible-vault encrypt_string` for Azure credentials).

3. **Assumptions**:
   - Assume the Azure CLI is installed on the machine running Ansible.
   - Assume the Azure Data Lake Storage account already exists, and you are mounting it to Databricks.
   - Assume the CSV data is accessible at `dbfs:/mnt/datalake/fraud_detection/`.

### Deliverables
1. **GitHub Actions Workflow**:
   - A `.yml` file for the CI/CD pipeline.
   - A `.gitignore` file.
   - A `requirements.txt` file.
   - A sample unit test file (`test_fraud_detection.py`).
2. **Ansible Playbook**:
   - An Ansible playbook (`deploy_databricks_infra.yml`).
   - An inventory file (`inventory.yml`).
   - A variable file (`vars.yml`).
   - Instructions for encrypting sensitive variables using Ansible vault.
3. **Markdown Explanation**:
   - Explain your CI/CD pipeline design, including how secrets are protected.
   - Describe the Ansible playbook structure and how it ensures idempotency.
   - Discuss31. **Code Quality**: Readability, modularity, proper error handling, and adherence to PEP 8 standards.
32. **Robustness**: The pipeline and playbook handle edge cases like missing dependencies, failed API calls, or existing resources.
33. **Efficiency**: The pipeline minimizes runtime, and the playbook optimizes resource provisioning.
34. **Reasoning**: Clear and logical explanations in the Markdown file, covering all tasks and edge cases.