# Spark Batch Processing Project Guide
---

Welcome to the **Spark Batch Processing Project**! This project demonstrates how to perform batch data processing using **Apache Spark**. The goal is to process a transactional dataset through data ingestion, cleaning, transformation, analysis, and visualization.

## Table of Contents
1. [Project Overview](#Project-Overview)
2. [Dataset Information](#Dataset-Information)
3. [Installation Guide](#Installation-Guide)
4. [Running the Project](#Running-the-Project)
5. [Outputs and Results](#Outputs-and-Results)
6. [Optional Scripts](#Optional-Scripts)

## 1. Project Overview
---

The project is structured into the following components:

- **Data Ingestion:** Reads raw data and converts it into an optimized format (Parquet).
- **Data Cleaning and Transformation:** Handles missing values, drops irrelevant columns, and adds derived columns.
- **Data Analysis:** Aggregates and summarizes data for insights.
- **Visualization:** Generates visual representations of the analysis.

### Project Folder Structure:
```
├── data                  # Raw datasets (.csv files)
├── notebooks             # Jupyter Notebooks (this guide)
├── outputs               # Processed data and visualizations
├── scripts               # Python scripts for each project step
│   ├── ingest_data.py
│   ├── clean_transform.py
│   ├── analyze_data.py
│   ├── visualize_outputs.py
│   ├── sample_dataset.py
│   ├── test_dataset.py
│   └── verify_cleaned_data.py
├── requirements.txt      # Python dependencies
└── README.md             # Project documentation
```

## 2. Dataset Information
---

This project uses a **synthetic transactional dataset** designed to simulate real-world financial transactions. The dataset contains information on purchases made by customers across various merchant categories. It is structured to help identify spending patterns and potential fraudulent activity.

### **Dataset Location:**
- Located in the `data/` folder:
  - `synthetic_fraud_data.csv`
  - `sample_fraud_data.csv`

### **Dataset Features:**
- `transaction_id`: Unique identifier for each transaction.
- `customer_id`: Unique identifier for each customer.
- `card_number`: Encrypted credit/debit card number.
- `timestamp`: Date and time of the transaction.
- `merchant_category`: Category of the merchant (e.g., Grocery, Travel).
- `merchant_type`: Type of merchant (e.g., physical store, online).
- `amount`: Transaction amount.
- `currency`: Currency type (e.g., USD, GBP).
- `country`: Country where the transaction occurred.
- `city`: City of the merchant.
- `card_present`: Boolean indicating if the card was physically present.
- `high_risk_merchant`: Boolean indicating if the merchant is high-risk.
- `distance_from_home`: Distance between the customer’s home and the merchant.
- `is_fraud`: Boolean flag indicating if the transaction is fraudulent.

### **Purpose:**
- To simulate **fraud detection** by analyzing transactional patterns.
- To demonstrate how **batch processing** can be used to clean, transform, and analyze large datasets.

## 3. Installation Guide
---

### 1. Install Docker
- Download and install **Docker Desktop** from [Docker's Official Website](https://www.docker.com/products/docker-desktop/).
- After installation, start Docker and ensure it is running.
- Verify installation:
```bash
docker --version
```

### 2. Install Python and Dependencies
- Install **Python 3.8+** from [Python's Official Website](https://www.python.org/downloads/).
- Install Python packages:
```bash
pip install -r requirements.txt
```

### 3. Pull the Apache Spark Docker Image
- Open your terminal and run:
```bash
docker pull bitnami/spark
```

### 4. Run the Docker Container
```bash
docker run -it -v /path/to/your/project:/app bitnami/spark /bin/bash
```
*Replace `/path/to/your/project` with your actual project folder path.*

## 4. Running the Project
---

### Step 1: Data Ingestion
```bash
python scripts/ingest_data.py
```

### Step 2: Data Cleaning and Transformation
```bash
python scripts/clean_transform.py
```

### Step 3: Data Analysis
```bash
python scripts/analyze_data.py
```

### Step 4: Data Visualization
```bash
python scripts/visualize_outputs.py
```


## 5. Outputs and Results
---

After running the project, check the **outputs/** folder for the following:

- **Processed Data:**
  - `raw_dataset.parquet` → Ingested data
  - `cleaned_dataset.parquet` → Cleaned and transformed data
- **Analysis Results:**
  - `merchant_summary.csv` → Transaction summary by merchant category
  - `high_value_summary.csv` → High-value transactions summary by country
- **Visualizations:**
  - `total_amount_by_category.png`
  - `high_value_amount_by_country.png`

## 6. Optional Scripts
---

The following scripts are optional but useful for testing and development:

- **`sample_dataset.py`**: Creates a smaller sample of the raw dataset for quicker testing.
- **`test_dataset.py`**: Verifies that the dataset can be properly loaded into Spark (useful for testing schema and data integrity).
- **`verify_cleaned_data.py`**: Loads and inspects the cleaned dataset to ensure that data transformations were correctly applied.

Run these scripts as needed for development or troubleshooting.