# ðŸ“Š Financial Performance Snapshot
## Data Cleaning & Preparation Notebook

### Scope of This Notebook

In this phase, we will:

- Load the raw transactional dataset
- Apply explicit, documented data cleaning rules based on prior exploration
- Enforce correct data types and structural consistency
- Handle missing, inconsistent, or invalid records
- Produce a clean, analysis-ready dataset for financial KPI computation

This notebook focuses on **data preparation and validation** and does not include KPI calculations or analytical insights.

---

### Expected Outputs

By the end of this notebook, we will have:

- A cleaned and standardized transactional dataset
- Clearly defined and reproducible data cleaning rules
- Flags for transactions requiring special consideration
- A validated dataset suitable for reliable financial analysis

The resulting dataset will serve as the input for **financial KPI computation** and **performance analysis** in subsequent steps.

---



## 1) Load raw data
**Objective**

Load the raw dataset and establish a working copy for cleaning.

In [1]:
import pandas as pd
import numpy as np
from pathlib import Path

PROJECT_ROOT = Path("..")
DATA_RAW = PROJECT_ROOT / "data" / "raw"
DATA_CLEAN = PROJECT_ROOT / "data" / "clean"

DATA_CLEAN.mkdir(exist_ok=True)

df = pd.read_csv(DATA_RAW / "financial_transactions_raw.csv")

df.shape

(500, 6)

## 2) Define Cleaning Rules

Based on exploration findings, the following rules are applied:

1. Convert `order_date` to datetime format; invalid or missing dates are excluded from time-based analysis.
2. Remove transactions with missing revenue values, as revenue is required for all financial KPIs.
3. Ensure `revenue` and `cost` are numeric fields.
4. Standardize categorical fields and handle missing category values explicitly.
5. Preserve one row per unique transaction (`order_id`).

All exclusions and transformations are applied systematically and documented.

## 3) Apply Cleaning Rules
**Objective**

Systematically apply the defined data cleaning rules to correct data types, remove invalid records, and standardize fields while preserving transparency and reproducibility.

In [2]:
# Convert Data Types
df['order_date'] = pd.to_datetime(df['order_date'], errors='coerce')
df['revenue'] = pd.to_numeric(df['revenue'], errors='coerce')
df['cost'] = pd.to_numeric(df['cost'], errors='coerce')

# Check Data Types
df.dtypes

order_id                   int64
order_date        datetime64[ns]
category                  object
payment_method            object
revenue                  float64
cost                     float64
dtype: object

In [3]:
# Remove transactions with missing revenue
initial_rows = df.shape[0]

df = df[df['revenue'].notna()]

removed_rows = initial_rows - df.shape[0]
removed_rows

10

In [4]:
# Handle missing dates
df = df[df['order_date'].notna()]

In [5]:
# Handle missing categorical values
df['category'] = df['category'].fillna('Unknown')

## 4) Post-cleaning validation
**Objective**

Confirm the dataset meets analytical requirements.

In [6]:
df.isna().sum()

order_id          0
order_date        0
category          0
payment_method    0
revenue           0
cost              0
dtype: int64

In [7]:
df.shape

(486, 6)

In [12]:
df[['revenue', 'cost']].describe()

Unnamed: 0,revenue,cost
count,486.0,486.0
mean,295.681605,182.865473
std,207.701323,136.482959
min,10.72,6.66
25%,138.79,84.545
50%,254.245,148.425
75%,410.6575,245.8075
max,1185.02,915.99


In [9]:
df['order_id'].is_unique

True

## 5) Save Clean Dataset
**Objective**

Save the cleaned, validated dataset in a standardized format to ensure it can be reliably reused for KPI analysis and downstream analytics.

In [10]:
clean_file_path = DATA_CLEAN / "financial_transactions_clean.csv"
df.to_csv(clean_file_path, index=False)

clean_file_path

WindowsPath('../data/clean/financial_transactions_clean.csv')

## 6) Cleaning Summary

- Transactions with missing revenue and missing transaction dates were removed.
- All financial fields were converted to numeric formats.
- Missing categorical values were standardized as `Unknown`.
- The resulting dataset is transaction-level, internally consistent, and suitable for financial KPI computation.

This cleaned dataset will be used as the input for KPI calculation and performance analysis.