# Data Lake Folder Structure

## Raw Data

- **Path:** `data_lake/raw/<source>/<source>_<YYYYMMDD>.csv`
    - `source` = `transactions` or `customers` (or other sources added later)
    - `YYYYMMDD` = ingestion date

## Processed Data

- **Path:** `data_lake/processed/<source>/<source>_clean_<YYYYMMDD>.csv`
    - Mirrors folder principles above, for processed outputs

## Logs

- **File:** `data_lake/ingestion.log`
    - Records all ingestion events and errors

**Structure ensures:**  
- Easy access by data source and date
- Scalability for additional sources/types  
- Reliable data traceability for audits and ML reproducibility


In [1]:
import os
import pandas as pd
from datetime import datetime

# Set up folder structure
data_lake_dir = 'data_lake'
raw_dirs = [
    os.path.join(data_lake_dir, 'raw/transactions'),
    os.path.join(data_lake_dir, 'raw/customers')
]
for d in raw_dirs:
    os.makedirs(d, exist_ok=True)

# Save any DataFrame to its source/type folder with timestamp
def save_raw_data(df, source, today=None):
    if today is None:
        today = datetime.now().strftime('%Y%m%d')
    dir_path = os.path.join(data_lake_dir, 'raw', source)
    os.makedirs(dir_path, exist_ok=True)
    file_path = os.path.join(dir_path, f"{source}_{today}.csv")
    df.to_csv(file_path, index=False)
    print(f"Saved raw data to {file_path}")

# Example usage:

# For already-ingested CSV
transactions = pd.read_csv('data_lake/raw/transactions/transactions_20250824.csv')
save_raw_data(transactions, 'transactions')

# For API data
customers = pd.read_csv('data_lake/raw/customers/random_users_20250824.csv')
save_raw_data(customers, 'customers')


FileNotFoundError: [Errno 2] No such file or directory: 'data_lake/raw/transactions/transactions_20250824.csv'