### Day 4 Exercise: Data Warehousing, Data Modeling & DevOps/CICD 🏗️

#### Objective
This exercise focuses on the principles of data warehousing and analytical data modeling. You will design a Star Schema for our fraud analytics use case and conceptualize how to build and deploy the data pipeline using DevOps and CI/CD practices.

#### Scenario
The fraud detection rules are now implemented in PySpark. The next step is to store the enriched transaction data and fraud flags in a structured, optimized way for business intelligence (BI) and analytics. You need to design a data model that allows analysts to easily query and build dashboards on fraudulent activities. You also need to think about how to automate the deployment of your data pipeline.

#### Data to Use 📊
You will use the conceptual output from the Day 3 exercise: a PySpark DataFrame (`sdf_final_transactions`) containing enriched transaction data along with the boolean fraud flags (`is_fraudulent_rule1`, `is_fraudulent_rule2`).

---

### Part 1: Data Modeling for Fraud Analytics (Conceptual) ⭐

#### 1.1 Design a Star Schema
Based on the available data, design a **Star Schema** to model the data for analytical querying. Define the tables, columns, data types, and relationships.

* **Fact Table**: `fact_transactions`
    * What are the measures (the quantitative values) in this table?
    * What are the foreign keys that will link to the dimension tables?
    * Define all columns, their data types (e.g., INTEGER, DECIMAL, VARCHAR, DATETIME, BOOLEAN), and their purpose.

* **Dimension Tables**:
    * **`dim_customer`**:
        * How would you populate this table?
        * Define its columns, including a surrogate key.
    * **`dim_date`**:
        * Why is a separate date dimension useful?
        * What are some useful attributes this table would contain (e.g., day, week, month, quarter, year, day_of_week)?
        * Define its columns.
    * **`dim_location`** (Optional, but recommended):
        * How could you create a location dimension from the `ip_address`? (Conceptually - e.g., using a Geo-IP lookup service).
        * What columns would it have?

#### 1.2 Slowly Changing Dimensions (SCD)
* In Day 2, we discussed SCD Type 2 for `customer_tier`. Now, provide a more detailed explanation.
* **Task**: Illustrate how you would update the `dim_customer` table if "Alice Smith" (customer_id 'C101') was upgraded from "Gold" to "Platinum" tier on `2024-07-18`.
* Show the state of the `dim_customer` table **before** and **after** the change, including the necessary columns to handle SCD Type 2 (e.g., `customer_sk`, `customer_id`, `tier`, `start_date`, `end_date`, `is_current`).

---

### Part 2: DevOps & CI/CD for Data Pipelines (Conceptual) 🤖

#### 2.1 Version Control with Git
* Describe how you would structure your project in a Git repository. What would be the key folders and files? (e.g., `/src` for code, `/tests` for unit tests, `/notebooks` for exploration, `requirements.txt`, etc.).
* Explain the purpose of a `.gitignore` file in this project. What are some specific files or directories you would add to it?

#### 2.2 Continuous Integration (CI)
* What is the primary goal of Continuous Integration for this data engineering project?
* Describe a simple CI pipeline for your PySpark fraud detection script. What are the key stages or steps?
    * **Hint**: Think about what should happen automatically when a developer pushes a change to a feature branch. (e.g., Trigger, Linting, Unit Testing, Packaging).

#### 2.3 Continuous Deployment/Delivery (CD)
* What is the difference between Continuous Delivery and Continuous Deployment in the context of this project?
* Describe a conceptual CD pipeline that takes the tested PySpark application from the CI pipeline and deploys it to your Azure Databricks environment. What are the key stages?
    * **Hint**: Think about environments (Dev, Staging, Prod), approvals, and the actual deployment steps.

SOLUTION

In [None]:
from pyspark.sql import SparkSession

# Initialize SparkSession
spark = SparkSession.builder.appName("FraudDataLoading").getOrCreate()

# Define the file path
file_path = "/content/drive/MyDrive/Colab Notebooks/fraud_detection_exercise/sdf_final_transaction.csv/part-00000-b3362811-915c-4425-8025-3a497ac441f9-c000.csv"

# Load the dataset into a DataFrame
# We assume the CSV has a header and infer the schema
sdf_final_transactions = spark.read.csv(file_path, header=True, inferSchema=True)


DataFrame Schema:
root
 |-- transaction_id: string (nullable = true)
 |-- customer_id: string (nullable = true)
 |-- amount: double (nullable = true)
 |-- timestamp: timestamp (nullable = true)
 |-- currency: string (nullable = true)
 |-- ip_address: string (nullable = true)
 |-- transaction_hour: integer (nullable = true)
 |-- customer_name: string (nullable = true)
 |-- customer_email: string (nullable = true)
 |-- registration_date: date (nullable = true)
 |-- customer_tier: string (nullable = true)
 |-- last_login_date: string (nullable = true)
 |-- is_fraudulent_rule1: boolean (nullable = true)
 |-- timestamp_to_date: date (nullable = true)
 |-- is_fraudulent_rule1_spark: boolean (nullable = true)
 |-- prev_ip_address: string (nullable = true)
 |-- transaction_count_10min: integer (nullable = true)
 |-- is_fraudulent_rule2: boolean (nullable = true)
 |-- is_fraudulent_combined: boolean (nullable = true)


DataFrame Records:
+--------------+-----------+------+-------------------+--

In [None]:

# Print the schema
print("DataFrame Schema:")
sdf_final_transactions.printSchema()

# Show some records
print("\nDataFrame Records:")
sdf_final_transactions.show()

DataFrame Schema:
root
 |-- transaction_id: string (nullable = true)
 |-- customer_id: string (nullable = true)
 |-- amount: double (nullable = true)
 |-- timestamp: timestamp (nullable = true)
 |-- currency: string (nullable = true)
 |-- ip_address: string (nullable = true)
 |-- transaction_hour: integer (nullable = true)
 |-- customer_name: string (nullable = true)
 |-- customer_email: string (nullable = true)
 |-- registration_date: date (nullable = true)
 |-- customer_tier: string (nullable = true)
 |-- last_login_date: string (nullable = true)
 |-- is_fraudulent_rule1: boolean (nullable = true)
 |-- timestamp_to_date: date (nullable = true)
 |-- is_fraudulent_rule1_spark: boolean (nullable = true)
 |-- prev_ip_address: string (nullable = true)
 |-- transaction_count_10min: integer (nullable = true)
 |-- is_fraudulent_rule2: boolean (nullable = true)
 |-- is_fraudulent_combined: boolean (nullable = true)


DataFrame Records:
+--------------+-----------+------+-------------------+--

fact table `fact_transactions`, this is our fact table that answers to the business requirement of collecting the transactions. The table have a set of columns similar to this (using a pseudocode):

```
CREATE TABLE fact_transactions(
transaction id, uuid (or integer), primary key
timestamp, datetime
amount, float
customer id, uuid, foreign key
fraud flag, boolean)
```

here we can store all the information about our transactions, then we create 3 dimensional tables to expand the data that we have about customers, timestamp of the transaction and position of the transaction.
Using a pseudocode, we will write:

```
CREATE TABLE dim_customer(
customer surrogate key, uuid, primary key
customer id, int, foreign key
full name, str
email, str
address, str
mobile, str
registration date, datetime
last login, datetime
tier, str
)
```
I would then populate this table using an INSERT INTO and doing schema inference from the data source, so that if the incoming schema does not match with these one we can collect all the transactions no matter what.

```
CREATE TABLE dim_location(
transaction surrogate key, uuid, primary key
transaction id, int, foreign key
timestamp, datetime, foreign key
ip address, str
loc x, str
loc y, str
)
```

```
CREATE TABLE dim_date(
transaction surrogate key, uuid, primary key
transaction id, int, foreign key
timestamp, datetime, foreign key
day, str
week, str
month, str
quarter, str
year, str
day of the week, str
)
```

Even though this table is not required, it is useful so that when we have to query the data filtering them based on the timestamps or dates, we can just join the table and do not have to run particularly complicated operations in our query because we have already unpacked in the table some of the derivates measure such as day of the week.

## I did not put any constraint to these table (such as transaction id not null or other primary key values not null) because we are inferring the schema on-read and we enforce it on read (please explain me again if it is correct, i remember that enforcing schema on-read means that when i read i enforce the schema based on a starting schema such the one that I wrote. Am I right? answer at the top of the answer.