# Full Load vs Incremental Load Case Study

## Learning Goals
- Understand Full Load vs Incremental Load concepts
- Work with CSV and JSON files using pandas
- Merge data from different sources
- Apply basic data transformations

## Key Concepts
- **Full Load**: Process all data from scratch every time
- **Incremental Load**: Process only new data since last run

## Scenario
You work for a retail company that gets daily sales data in different formats. You need to combine and process this data efficiently.

## Import Required Libraries

In [None]:
import pandas as pd
import json

print("Libraries imported successfully!")

## TASK 1: Create Sample Data Files

First, let's create sample data files with the same schema in both CSV and JSON formats.

### Create Full Load Data (CSV) - Complete Dataset

In [None]:
%%writefile sales_full.csv
transaction_id,customer_id,product_name,quantity,price,sale_date
TXN001,CUST001,Laptop,1,999.99,2024-01-01
TXN002,CUST002,Mouse,2,25.50,2024-01-01
TXN003,CUST001,Keyboard,1,75.00,2024-01-02
TXN004,CUST003,Monitor,1,299.99,2024-01-02
TXN005,CUST002,Headphones,1,89.99,2024-01-03
TXN006,CUST003,Webcam,1,49.99,2024-01-03
TXN007,CUST001,Tablet,1,399.99,2024-01-04

### Create Incremental Load Data (JSON) - New Transactions Only

In [None]:
%%writefile sales_new.json
[
  {
    "transaction_id": "TXN008",
    "customer_id": "CUST002",
    "product_name": "Speaker",
    "quantity": 1,
    "price": 79.99,
    "sale_date": "2024-01-05"
  },
  {
    "transaction_id": "TXN009",
    "customer_id": "CUST004",
    "product_name": "Phone",
    "quantity": 1,
    "price": 699.99,
    "sale_date": "2024-01-05"
  },
  {
    "transaction_id": "TXN010",
    "customer_id": "CUST003",
    "product_name": "Cable",
    "quantity": 3,
    "price": 15.99,
    "sale_date": "2024-01-06"
  }
]

## TASK 2: Load Data with Pandas

Now let's load both files using pandas and understand the difference between full load and incremental load.

In [None]:
# FULL LOAD: Load complete dataset from CSV
print("=== FULL LOAD ===")
print("Loading complete dataset from CSV file...")

df_full = pd.read_csv('sales_full.csv')
print(f"Full load dataset shape: {df_full.shape}")
print("\nFull dataset:")
print(df_full)

print("\n" + "="*50)

# INCREMENTAL LOAD: Load only new data from JSON
print("=== INCREMENTAL LOAD ===")
print("Loading only new transactions from JSON file...")

df_new = pd.read_json('sales_new.json')
print(f"Incremental load dataset shape: {df_new.shape}")
print("\nNew transactions:")
print(df_new)

## TASK 3: Merge Data from Both Sources

Let's combine the full load data with the incremental load data to create a complete dataset.

In [None]:
# Merge full load and incremental load data
print("Merging full load and incremental load data...")

# Combine both dataframes
df_combined = pd.concat([df_full, df_new], ignore_index=True)

print(f"Full load records: {len(df_full)}")
print(f"Incremental load records: {len(df_new)}")
print(f"Combined dataset records: {len(df_combined)}")

print("\nCombined dataset:")
print(df_combined)

# Check data types and basic info
print("\nDataset Info:")
print(df_combined.info())

## TASK 4: Remove Duplicates and Apply Transformations

Let's clean the data and apply some basic transformations and aggregations.

### Step 4.1: Check for and Remove Duplicates

In [None]:
# Check for duplicate transactions
print("Checking for duplicates...")
print(f"Total records before deduplication: {len(df_combined)}")

# Check if there are any duplicate transaction_ids
duplicates = df_combined[df_combined.duplicated(subset=['transaction_id'], keep=False)]
print(f"Duplicate transactions found: {len(duplicates)}")

if len(duplicates) > 0:
    print("\nDuplicate records:")
    print(duplicates)

# Remove duplicates based on transaction_id
df_clean = df_combined.drop_duplicates(subset=['transaction_id'], keep='first')
print(f"\nRecords after deduplication: {len(df_clean)}")
print(f"Removed {len(df_combined) - len(df_clean)} duplicate records")

### Step 4.2: Add Calculated Columns

In [None]:
# Add calculated columns
print("Adding calculated columns...")

# Calculate total amount for each transaction
df_clean['total_amount'] = df_clean['quantity'] * df_clean['price']

# Convert sale_date to datetime
df_clean['sale_date'] = pd.to_datetime(df_clean['sale_date'])

# Add month column
df_clean['sale_month'] = df_clean['sale_date'].dt.strftime('%Y-%m')

print("\nDataset with calculated columns:")
print(df_clean[['transaction_id', 'product_name', 'quantity', 'price', 'total_amount', 'sale_month']].head())

### Step 4.3: Apply Aggregations

In [None]:
# Group by customer and calculate totals
print("=== CUSTOMER ANALYSIS ===")
customer_summary = df_clean.groupby('customer_id').agg({
    'total_amount': ['sum', 'count', 'mean'],
    'quantity': 'sum'
}).round(2)

# Flatten column names
customer_summary.columns = ['total_spent', 'num_transactions', 'avg_transaction', 'total_items']
customer_summary = customer_summary.reset_index()

print("Customer Summary:")
print(customer_summary)

print("\n=== PRODUCT ANALYSIS ===")
product_summary = df_clean.groupby('product_name').agg({
    'quantity': 'sum',
    'total_amount': 'sum',
    'transaction_id': 'count'
}).round(2)

product_summary.columns = ['total_quantity_sold', 'total_revenue', 'num_sales']
product_summary = product_summary.reset_index().sort_values('total_revenue', ascending=False)

print("Product Summary (sorted by revenue):")
print(product_summary)

print("\n=== OVERALL SUMMARY ===")
print(f"Total Revenue: ${df_clean['total_amount'].sum():.2f}")
print(f"Total Transactions: {len(df_clean)}")
print(f"Average Transaction Value: ${df_clean['total_amount'].mean():.2f}")
print(f"Total Items Sold: {df_clean['quantity'].sum()}")
print(f"Unique Customers: {df_clean['customer_id'].nunique()}")
print(f"Unique Products: {df_clean['product_name'].nunique()}")

## Summary

### What You Learned:

**Full Load vs Incremental Load:**
- **Full Load**: Process complete dataset from scratch (like our CSV file)
- **Incremental Load**: Process only new data since last run (like our JSON file)

**Key Skills:**
- Loading CSV and JSON files with pandas
- Merging data from different sources with same schema
- Removing duplicates to ensure data quality
- Adding calculated columns and transformations
- Creating aggregations and summaries

**Real-World Applications:**
- Daily sales data processing
- Customer analytics
- Product performance analysis
- Data quality management

### Key Takeaways:
1. **Full Load** is good for initial loads or when you need to reprocess everything
2. **Incremental Load** is more efficient for regular updates with only new data
3. **Same schema** makes it easy to merge data from different file formats
4. **Data cleaning** (removing duplicates) is essential for accurate analysis
5. **Pandas** makes data processing much easier than basic Python file operations

Great job completing this case study! You now understand fundamental data loading patterns used in real data engineering projects.