# Data Download and Preparation - Criteo CTR Dataset

This notebook helps you download and prepare the Criteo CTR dataset.

## Dataset Information:
- **Name**: Criteo Click-Through Rate Prediction
- **Source**: Kaggle
- **Size**: ~40M rows, ~10GB
- **URL**: https://www.kaggle.com/c/criteo-display-ad-challenge

## Steps:
1. Download dataset from Kaggle
2. Extract and place in `data/raw/`
3. Create 1M row sample for testing
4. Verify data loading

## Option 1: Manual Download (Recommended for First Time)

### Steps:

1. **Create Kaggle Account**:
   - Go to https://www.kaggle.com/
   - Sign up or log in

2. **Get API Credentials**:
   - Go to https://www.kaggle.com/settings/account
   - Scroll to "API" section
   - Click "Create New API Token"
   - This downloads `kaggle.json` to your Downloads folder

3. **Setup Kaggle API**:
   ```bash
   # Create .kaggle directory
   mkdir -p ~/.kaggle
   
   # Move kaggle.json (replace path if different)
   mv ~/Downloads/kaggle.json ~/.kaggle/
   
   # Set permissions
   chmod 600 ~/.kaggle/kaggle.json
   ```

4. **Install Kaggle CLI**:
   ```bash
   pip install kaggle
   ```

## Option 2: Download via Kaggle API (After Setup)

In [None]:
# Install kaggle if not already installed
!pip install kaggle

In [None]:
# Download Criteo dataset
!kaggle competitions download -c criteo-display-ad-challenge -p ../data/raw/

In [None]:
# Extract the dataset
import zipfile
import os

zip_path = '../data/raw/criteo-display-ad-challenge.zip'
extract_path = '../data/raw/'

if os.path.exists(zip_path):
    print(f"Extracting {zip_path}...")
    with zipfile.ZipFile(zip_path, 'r') as zip_ref:
        zip_ref.extractall(extract_path)
    print("Extraction complete!")
    
    # List extracted files
    print("\nExtracted files:")
    for file in os.listdir(extract_path):
        file_path = os.path.join(extract_path, file)
        if os.path.isfile(file_path):
            size_mb = os.path.getsize(file_path) / (1024 * 1024)
            print(f"  - {file} ({size_mb:.2f} MB)")
else:
    print(f"ZIP file not found: {zip_path}")
    print("Please download manually from Kaggle.")

## Alternative: Use Smaller Dataset for Testing

If you don't have access to Kaggle or want to test quickly, you can generate synthetic data:

In [None]:
import sys
sys.path.append('..')

from src.config import Config
from src.utils.spark_utils import create_spark_session
from pyspark.sql import functions as F
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
import random

# Create config
config = Config('../config/config.yaml')

# Create Spark session
spark = create_spark_session(
    app_name="SyntheticData",
    master="local[*]"
)

# Generate synthetic data
def generate_synthetic_data(n_rows=10000):
    """Generate synthetic Criteo-like data for testing."""
    print(f"Generating {n_rows:,} synthetic rows...")
    
    data = []
    for _ in range(n_rows):
        row = [
            random.choices([0, 1], weights=[0.97, 0.03])[0]  # click (3% CTR)
        ]
        
        # Add numerical features I1-I13
        for _ in range(13):
            row.append(random.randint(0, 100) if random.random() > 0.1 else None)
        
        # Add categorical features C1-C26
        for _ in range(26):
            row.append(f"cat_{random.randint(1, 1000)}" if random.random() > 0.05 else None)
        
        data.append(tuple(row))
    
    # Define schema
    fields = [StructField("click", IntegerType(), True)]
    for i in range(1, 14):
        fields.append(StructField(f"I{i}", IntegerType(), True))
    for i in range(1, 27):
        fields.append(StructField(f"C{i}", StringType(), True))
    
    schema = StructType(fields)
    
    # Create DataFrame
    df = spark.createDataFrame(data, schema)
    
    return df

# Generate data
synthetic_df = generate_synthetic_data(n_rows=10000)

# Save as sample
sample_path = '../data/sample/'
synthetic_df.write.parquet(sample_path, mode='overwrite')

print(f"\nSynthetic data saved to: {sample_path}")
print(f"Rows: {synthetic_df.count():,}")
synthetic_df.show(5)

## Verify Data Loading

In [None]:
import sys
sys.path.append('..')

from src.config import Config
from src.utils.spark_utils import create_spark_session
from src.data.loader import CriteoDataLoader

# Load config
config = Config('../config/config.yaml')

# Create Spark session
spark = create_spark_session(
    app_name=config['spark']['app_name'],
    master=config['spark']['master']
)

# Initialize loader
loader = CriteoDataLoader(spark, config)

# Try loading sample
sample_path = config['data']['sample_path']
df = loader.load_parquet(sample_path)

print(f"\nData loaded successfully!")
print(f"Rows: {df.count():,}")
print(f"Columns: {len(df.columns)}")

# Show sample
df.show(5)

# Check click rate
from pyspark.sql import functions as F
click_rate = df.filter(F.col('click') == 1).count() / df.count()
print(f"\nClick rate: {click_rate:.4f} ({click_rate*100:.2f}%)")

In [None]:
# Stop Spark
spark.stop()
print("Spark session stopped.")

## Next Steps

1. ✅ Data downloaded and verified
2. ➡️ Run `02_eda.ipynb` for exploratory data analysis
3. ➡️ Run `03_baseline_model.ipynb` to train baseline model