# Data Preprocessing for Cats vs Dogs - Google Colab

This notebook downloads, preprocesses, and uploads the Cats vs Dogs dataset to DVC.

**Run this notebook ONCE to prepare your data, then use the training notebook.**

## What this notebook does:
1. Downloads raw Cats vs Dogs dataset (~800MB)
2. Preprocesses images (resize to 150x150, normalize)
3. Splits into train/val/test sets (70%/15%/15%)
4. Saves processed data as .npy files
5. Pushes to DVC remote for reuse

## Prerequisites:
- Google account
- Access to your GitHub repository
- DVC remote credentials (Backblaze B2 or S3)

**‚è±Ô∏è Expected time**: 15-20 minutes

## 1. Clone Repository

In [None]:
import os

# Set your GitHub username and repo name
GITHUB_USERNAME = "bigalex95"  # Change this to your username
REPO_NAME = "are-you-a-cat-mlops-pipeline"
REPO_URL = f"https://github.com/{GITHUB_USERNAME}/{REPO_NAME}.git"

# Remove if already exists
if os.path.exists(REPO_NAME):
    !rm -rf {REPO_NAME}

# Clone the repository
!git clone {REPO_URL}

# Change to repository directory
%cd {REPO_NAME}

print("\n‚úÖ Repository cloned!")

## 2. Install Dependencies

In [None]:
# Install required packages
!pip install -q tensorflow tensorflow-datasets
!pip install -q dvc boto3 s3fs
!pip install -q numpy pillow

print("\n‚úÖ All dependencies installed!")

## 3. Configure DVC Remote

Set up your DVC credentials to push the processed data.

**Security Note:** Use Colab secrets for credentials.

In [None]:
import os
from getpass import getpass

# Option 1: Use Colab secrets (recommended)
try:
    from google.colab import userdata
    AWS_ACCESS_KEY_ID = userdata.get('AWS_ACCESS_KEY_ID')
    AWS_SECRET_ACCESS_KEY = userdata.get('AWS_SECRET_ACCESS_KEY')
    print("‚úÖ Using credentials from Colab secrets")
except:
    # Option 2: Enter credentials manually
    print("Enter your DVC remote credentials (Backblaze B2 or S3):")
    AWS_ACCESS_KEY_ID = getpass("Access Key ID: ")
    AWS_SECRET_ACCESS_KEY = getpass("Secret Access Key: ")

# Set environment variables for DVC
os.environ['AWS_ACCESS_KEY_ID'] = AWS_ACCESS_KEY_ID
os.environ['AWS_SECRET_ACCESS_KEY'] = AWS_SECRET_ACCESS_KEY

print("\n‚úÖ DVC credentials configured!")

## 4. Check Existing Data

Let's check if we already have the data before downloading/processing.

In [None]:
import os
import subprocess

# Check if processed data exists locally
PROCESSED_DIR = 'data/processed'
PROCESSED_FILES = [
    'train_images.npy', 'train_labels.npy',
    'val_images.npy', 'val_labels.npy',
    'test_images.npy', 'test_labels.npy'
]

processed_exists_local = all(
    os.path.exists(os.path.join(PROCESSED_DIR, f)) for f in PROCESSED_FILES
)

# Check if raw data exists locally
RAW_DATA_DIR = 'data/raw/cats_vs_dogs/4.0.1'
raw_exists_local = os.path.exists(RAW_DATA_DIR) and len(os.listdir(RAW_DATA_DIR)) > 0

# Check if data exists in DVC remote
dvc_file_exists = os.path.exists('data/processed.dvc')
processed_exists_remote = False

if dvc_file_exists:
    try:
        # Check DVC status - if files are missing locally but exist in remote, status will show them
        result = subprocess.run(
            ['dvc', 'status', 'data/processed.dvc'],
            capture_output=True,
            text=True,
            check=False
        )
        # If status is empty or says "Data and pipelines are up to date", data exists in remote
        output = result.stdout.strip()
        if not output or 'up to date' in output.lower():
            processed_exists_remote = True
        elif 'not in cache' in output.lower():
            processed_exists_remote = False
        else:
            # Try to check if we can pull (data exists in remote)
            result = subprocess.run(
                ['dvc', 'status', '--cloud', 'data/processed.dvc'],
                capture_output=True,
                text=True,
                check=False
            )
            # If no output or "up to date", data is in remote
            processed_exists_remote = not result.stdout.strip() or 'up to date' in result.stdout.lower()
    except Exception as e:
        print(f"Note: Could not check DVC remote status: {e}")
        processed_exists_remote = False

print("=" * 80)
print("DATA STATUS CHECK")
print("=" * 80)

print(f"\nüìÅ Raw Data Status (Local):")
if raw_exists_local:
    print("  ‚úÖ Raw data exists in data/raw/")
    print(f"  üìä Files found: {len(os.listdir(RAW_DATA_DIR))} files")
else:
    print("  ‚ùå Raw data not found - will download (~800MB)")

print(f"\nüìÅ Processed Data Status:")
print(f"\n  Local:")
if processed_exists_local:
    print("    ‚úÖ Processed data exists in data/processed/")
    # Show file sizes
    !ls -lh data/processed/*.npy 2>/dev/null || echo "    (Files exist but couldn't list)"
else:
    print("    ‚ùå Processed data not found locally")

print(f"\n  DVC Remote:")
if dvc_file_exists:
    if processed_exists_remote:
        print("    ‚úÖ Processed data exists in DVC remote storage")
        if not processed_exists_local:
            print("    üí° You can pull it with: dvc pull data/processed.dvc")
    else:
        print("    ‚ùå Processed data not found in DVC remote")
else:
    print("    ‚ö†Ô∏è  No DVC tracking file (data/processed.dvc) found")

print("\n" + "=" * 80)

# Decision flags
NEED_DOWNLOAD = not raw_exists_local and not processed_exists_local
NEED_PROCESS = not processed_exists_local
CAN_PULL_FROM_DVC = dvc_file_exists and processed_exists_remote and not processed_exists_local

if processed_exists_local:
    print("üéâ Processed data exists locally! You can skip to step 8 to verify/push to DVC.")
elif CAN_PULL_FROM_DVC:
    print("? Processed data exists in DVC remote but not locally.")
    print("   You can pull it (faster) or reprocess from raw data.")
    print("   Recommendation: Pull from DVC (run next cell to pull)")
elif not NEED_DOWNLOAD and NEED_PROCESS:
    print("üìù Raw data exists locally, but needs processing.")
    print("   Will skip download and only preprocess.")
elif NEED_DOWNLOAD and NEED_PROCESS:
    print("üì• Will download raw data and preprocess it.")
else:
    print("‚ö†Ô∏è  Unusual state detected. Review the status above.")

print("=" * 80)

## 5. Pull from DVC Remote (if available)

If processed data exists in DVC remote, pull it instead of reprocessing.

In [None]:
# Try to pull from DVC if data exists in remote but not locally
if CAN_PULL_FROM_DVC:
    print("=" * 80)
    print("üì• PULLING PROCESSED DATA FROM DVC REMOTE")
    print("=" * 80)
    print("\nThis is much faster than reprocessing!")
    print("Downloading processed data...\n")
    
    !dvc pull data/processed.dvc
    
    # Verify the pull was successful
    if all(os.path.exists(os.path.join(PROCESSED_DIR, f)) for f in PROCESSED_FILES):
        print("\n‚úÖ Successfully pulled processed data from DVC remote!")
        processed_exists_local = True
        NEED_PROCESS = False
        NEED_DOWNLOAD = False
    else:
        print("\n‚ö†Ô∏è Pull completed but some files are missing. Will need to reprocess.")
        NEED_PROCESS = True
else:
    if processed_exists_local:
        print("‚è≠Ô∏è  Processed data already exists locally. Skipping DVC pull.")
    else:
        print("‚è≠Ô∏è  No data in DVC remote. Will process from raw data or download.")

## 6. Download and Preprocess Data (if needed)

This will:
- Download the Cats vs Dogs dataset (~800MB) - **only if raw data doesn't exist**
- Resize all images to 150x150
- Normalize pixel values to [0, 1]
- Split into train (70%), validation (15%), test (15%)
- Save as .npy files

**‚è±Ô∏è This takes 10-15 minutes** (or skip if data exists locally or in DVC)

In [None]:
import sys
sys.path.append('src')

# Only process if needed
if NEED_PROCESS:
    from data_loader import load_dataset
    
    print("Starting data preprocessing...\n")
    print("This will:")
    if NEED_DOWNLOAD:
        print("  1. Download raw Cats vs Dogs dataset (~800MB)")
    else:
        print("  1. ‚úÖ Using existing raw data (skipping download)")
    print("  2. Preprocess and resize images to 150x150")
    print("  3. Split into train/val/test sets (70%/15%/15%)")
    print("  4. Save processed data as .npy files")
    print("\n‚è±Ô∏è This may take 10-15 minutes...\n")
    print("="*80)
    
    # Load, preprocess, split, and save data
    # This function handles everything automatically
    train_data, val_data, test_data = load_dataset(split='train')
    
    print("\n" + "="*80)
    print("\n‚úÖ Data preprocessing completed!")
else:
    print("="*80)
    print("‚è≠Ô∏è  SKIPPING: Processed data already exists!")
    print("="*80)
    print("\nLoading existing processed data for verification...")
    
    from preprocess import load_processed_data
    train_data, val_data, test_data = load_processed_data()
    
    print("‚úÖ Existing data loaded successfully!")

## 7. Verify Processed Data

Let's check what was created or loaded.

In [None]:
import numpy as np

print("Processed data files:")
!ls -lh data/processed/

# Load and verify data shapes
X_train, y_train = train_data
X_val, y_val = val_data
X_test, y_test = test_data

print(f"\nData shapes:")
print(f"  Training:   {X_train.shape} images, {y_train.shape} labels")
print(f"  Validation: {X_val.shape} images, {y_val.shape} labels")
print(f"  Test:       {X_test.shape} images, {y_test.shape} labels")

print(f"\nClass distribution:")
print(f"  Training:   {np.sum(y_train == 0)} cats, {np.sum(y_train == 1)} dogs")
print(f"  Validation: {np.sum(y_val == 0)} cats, {np.sum(y_val == 1)} dogs")
print(f"  Test:       {np.sum(y_test == 0)} cats, {np.sum(y_test == 1)} dogs")

print(f"\nValue ranges:")
print(f"  Min: {X_train.min():.3f}")
print(f"  Max: {X_train.max():.3f}")

print("\n‚úÖ Data verification complete!")

## 8. Visualize Sample Images

Let's verify the preprocessing worked correctly.

In [None]:
import matplotlib.pyplot as plt

# Visualize some training images
fig, axes = plt.subplots(2, 5, figsize=(15, 6))
axes = axes.ravel()

for i in range(10):
    axes[i].imshow(X_train[i])
    label = "Dog" if y_train[i] == 1 else "Cat"
    axes[i].set_title(f"{label}")
    axes[i].axis('off')

plt.tight_layout()
plt.show()

print("‚úÖ Images look good!")

## 9. Add to DVC (if not already tracked)

Track the processed data directory with DVC.

In [None]:
import os

# Check if already tracked by DVC
if os.path.exists('data/processed.dvc'):
    print("‚úÖ Data already tracked by DVC!")
    print("\nExisting DVC file:")
    !ls -lh data/processed.dvc
    
    # Check if data changed
    print("\nChecking if data changed...")
    !dvc status data/processed.dvc
else:
    # Add processed data directory to DVC
    print("Adding processed data to DVC...\n")
    !dvc add data/processed
    
    print("\n‚úÖ Data added to DVC!")
    print("\nCreated files:")
    !ls -lh data/processed.dvc data/.gitignore

## 10. Push to DVC Remote

Upload the processed data to your remote storage (if needed).

In [None]:
# Check if we need to push to remote
if processed_exists_remote and not NEED_PROCESS:
    print("=" * 80)
    print("‚úÖ Data already exists in DVC remote - skipping push")
    print("=" * 80)
else:
    print("Pushing processed data to DVC remote...")
    print("This may take a few minutes (~1-2GB upload)\n")
    
    !dvc push data/processed.dvc
    
    print("\n" + "="*80)
    print("‚úÖ Processed data successfully pushed to DVC remote!")
    print("="*80)

## 11. Commit to Git

Commit the DVC metadata files to your repository.

In [None]:
# Configure git (replace with your info)
!git config --global user.email "your.email@example.com"
!git config --global user.name "Your Name"

# Add DVC files
!git add data/processed.dvc data/.gitignore

# Commit changes
!git commit -m "Add preprocessed data to DVC"

print("\n‚úÖ Changes committed!")
print("\nTo push to GitHub, authenticate and run:")
print("  git push origin model-development")

## 12. Summary

### ‚úÖ What we accomplished:
1. ‚úÖ Checked for existing data (local and DVC remote)
2. ‚úÖ Pulled from DVC remote (if available) OR
3. ‚úÖ Downloaded Cats vs Dogs dataset (if needed) OR
4. ‚úÖ Preprocessed images (if needed)
5. ‚úÖ Verified data quality
6. ‚úÖ Tracked with DVC (if not already)
7. ‚úÖ Pushed to DVC remote (if needed)
8. ‚úÖ Committed metadata to git

### üìä Processed Data:
- **Location**: `data/processed/`
- **Files**: train_images.npy, train_labels.npy, val_images.npy, val_labels.npy, test_images.npy, test_labels.npy
- **DVC file**: `data/processed.dvc`

### üéØ Next Steps:
1. Push changes to GitHub (if not done automatically)
2. **Use the `colab_model_training.ipynb` notebook to train your model**
3. The training notebook will pull this preprocessed data automatically

### üí° Smart Features:
- ‚ú® **Checks DVC remote** before downloading/processing
- ‚ú® **Pulls from DVC** if data exists (much faster than reprocessing)
- ‚ú® **Skips unnecessary steps** automatically
- ‚ú® **Handles all scenarios**:
  - Data exists in DVC remote ‚Üí Pull it
  - Data exists locally ‚Üí Skip everything
  - Raw data exists ‚Üí Skip download, only preprocess
  - Nothing exists ‚Üí Download and preprocess

### üìä Typical Scenarios:

**First Time (Nothing exists):**
- ‚è±Ô∏è 15-20 minutes: Download + Preprocess + Push

**Second Time (Data in DVC):**
- ‚è±Ô∏è 2-3 minutes: Pull from DVC

**Local Data Exists:**
- ‚è±Ô∏è 30 seconds: Verify and update DVC if needed

---

**You're all set! Now use the training notebook to train your model. üöÄ**