# Local Spark with S3 Integration

This notebook demonstrates how to:
1. Load AWS credentials from a local JSON file
2. Configure a local Spark session for S3 access
3. Read sample clickstream data from S3
4. Validate the connection and explore the data

**Prerequisites:**
- Java 8 or 11 installed (`java -version`)
- PySpark installed (`pip install pyspark`)
- AWS credentials file (see `aws_credentials_template.json`)

## 1. Setup and Configuration

In [None]:
# Install required packages if needed
# !pip install pyspark boto3

In [None]:
import os
import json
from pathlib import Path

# Configuration - Update these paths for your environment
CREDENTIALS_FILE = Path.home() / '.aws' / 'aws_credentials.json'
# Alternative: use a path relative to this notebook
# CREDENTIALS_FILE = Path('./aws_credentials.json')

# S3 paths - Update for your buckets
S3_BRONZE_BUCKET = 'your-bronze-bucket-name'
S3_DATA_KEY = 'clickstream/sample_clickstream.json'

print(f"Credentials file: {CREDENTIALS_FILE}")
print(f"S3 path: s3://{S3_BRONZE_BUCKET}/{S3_DATA_KEY}")

## 2. Load AWS Credentials from JSON File

**Expected JSON format (`aws_credentials.json`):**
```json
{
  "aws_access_key_id": "YOUR_ACCESS_KEY_ID",
  "aws_secret_access_key": "YOUR_SECRET_ACCESS_KEY",
  "region": "eu-north-1",
  "session_token": null
}
```

In [None]:
def load_aws_credentials(credentials_path: Path) -> dict:
    """
    Load AWS credentials from a JSON file.
    
    Args:
        credentials_path: Path to the JSON credentials file
    
    Returns:
        Dictionary with AWS credentials
    
    Raises:
        FileNotFoundError: If credentials file doesn't exist
        ValueError: If required keys are missing
    """
    if not credentials_path.exists():
        raise FileNotFoundError(
            f"Credentials file not found: {credentials_path}\n"
            f"Create it using aws_credentials_template.json as a reference."
        )
    
    with open(credentials_path, 'r') as f:
        creds = json.load(f)
    
    required_keys = ['aws_access_key_id', 'aws_secret_access_key']
    missing = [k for k in required_keys if not creds.get(k)]
    
    if missing:
        raise ValueError(f"Missing required credentials: {missing}")
    
    print(f"‚úÖ Loaded credentials for region: {creds.get('region', 'not specified')}")
    print(f"   Access Key ID: {creds['aws_access_key_id'][:8]}...")
    
    return creds

# Load credentials
try:
    aws_creds = load_aws_credentials(CREDENTIALS_FILE)
except FileNotFoundError as e:
    print(f"‚ö†Ô∏è  {e}")
    print("\nFor testing, you can also use environment variables or ~/.aws/credentials")
    aws_creds = None

## 3. Create Spark Session with S3 Configuration

This configures Spark to use the Hadoop AWS library for S3 access.

In [None]:
from pyspark.sql import SparkSession

def create_spark_session_with_s3(credentials: dict = None) -> SparkSession:
    """
    Create a local Spark session configured for S3 access.
    
    Args:
        credentials: Dictionary with AWS credentials (optional if using IAM role or env vars)
    
    Returns:
        Configured SparkSession
    """
    builder = SparkSession.builder \
        .appName("LocalS3SparkSession") \
        .master("local[*]") \
        .config("spark.driver.memory", "2g") \
        .config("spark.sql.shuffle.partitions", "4") \
        .config("spark.jars.packages", 
                "org.apache.hadoop:hadoop-aws:3.3.4,"
                "com.amazonaws:aws-java-sdk-bundle:1.12.262") \
        .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
    
    # Add credentials configuration if provided
    if credentials:
        builder = builder \
            .config("spark.hadoop.fs.s3a.access.key", credentials['aws_access_key_id']) \
            .config("spark.hadoop.fs.s3a.secret.key", credentials['aws_secret_access_key'])
        
        if credentials.get('session_token'):
            builder = builder \
                .config("spark.hadoop.fs.s3a.session.token", credentials['session_token']) \
                .config("spark.hadoop.fs.s3a.aws.credentials.provider",
                        "org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider")
        
        if credentials.get('region'):
            builder = builder \
                .config("spark.hadoop.fs.s3a.endpoint", f"s3.{credentials['region']}.amazonaws.com")
    else:
        # Use default credential provider chain (env vars, ~/.aws/credentials, IAM role)
        builder = builder \
            .config("spark.hadoop.fs.s3a.aws.credentials.provider",
                    "com.amazonaws.auth.DefaultAWSCredentialsProviderChain")
    
    spark = builder.getOrCreate()
    
    print(f"‚úÖ Spark session created")
    print(f"   Version: {spark.version}")
    print(f"   App ID: {spark.sparkContext.applicationId}")
    
    return spark

# Create Spark session
spark = create_spark_session_with_s3(aws_creds)

## 4. Read Sample Data from S3

Now let's read the sample clickstream data from your S3 bucket.

In [None]:
def read_json_from_s3(spark: SparkSession, bucket: str, key: str):
    """
    Read JSON data from S3 into a Spark DataFrame.
    
    Args:
        spark: Active SparkSession
        bucket: S3 bucket name
        key: S3 object key (path within bucket)
    
    Returns:
        Spark DataFrame
    """
    s3_path = f"s3a://{bucket}/{key}"
    
    print(f"üìñ Reading from: {s3_path}")
    
    try:
        df = spark.read.json(s3_path)
        record_count = df.count()
        print(f"‚úÖ Successfully read {record_count} records")
        return df
    except Exception as e:
        print(f"‚ùå Failed to read from S3: {e}")
        raise

# Read data from S3
# Uncomment when you have actual S3 bucket configured
# df = read_json_from_s3(spark, S3_BRONZE_BUCKET, S3_DATA_KEY)

## 5. Alternative: Read Local Sample Data

For testing without S3, you can read the local sample file:

In [None]:
# Path to local sample data (adjust as needed)
LOCAL_SAMPLE_PATH = '../EMR/Pyspark/sample_clickstream.json'

print(f"üìñ Reading local file: {LOCAL_SAMPLE_PATH}")

df = spark.read.json(LOCAL_SAMPLE_PATH)

print(f"‚úÖ Loaded {df.count()} records from local file")

## 6. Validate Schema and Explore Data

In [None]:
# Display schema
print("üìã DataFrame Schema:")
df.printSchema()

In [None]:
# Show sample records
print("üìä Sample Records:")
df.show(truncate=False)

In [None]:
# Basic statistics
print("üìà Numeric Column Statistics:")
df.select('sensor_id', 'temperature', 'humidity').describe().show()

In [None]:
# Status distribution
print("üîç Status Distribution:")
df.groupBy('status').count().show()

In [None]:
# Event type distribution
print("üîç Event Type Distribution:")
df.groupBy('event_type').count().show()

## 7. Validate Data Quality

In [None]:
from pyspark.sql.functions import col, when, count, lit

def validate_clickstream_data(df):
    """
    Validate clickstream data against expected quality rules.
    
    Returns a summary of validation results.
    """
    validations = {
        'sensor_id_positive': (col('sensor_id') > 0),
        'temp_in_range': (col('temperature').between(-50, 100)),
        'humidity_in_range': (col('humidity').between(0, 100)),
        'valid_status': col('status').isin('active', 'faulty', 'inactive'),
        'timestamp_positive': (col('timestamp') > 0)
    }
    
    total = df.count()
    print(f"üìä Data Quality Validation (Total Records: {total})")
    print("=" * 50)
    
    for rule_name, condition in validations.items():
        valid_count = df.filter(condition).count()
        pct = (valid_count / total) * 100 if total > 0 else 0
        status = "‚úÖ" if pct == 100 else "‚ö†Ô∏è"
        print(f"{status} {rule_name}: {valid_count}/{total} ({pct:.1f}%)")
    
    print("=" * 50)

validate_clickstream_data(df)

## 8. Test S3 Write (Optional)

If you want to test writing back to S3:

In [None]:
# Uncomment to test S3 write
# OUTPUT_PATH = f"s3a://{S3_BRONZE_BUCKET}/test_output/"
# 
# df.write \
#     .mode("overwrite") \
#     .parquet(OUTPUT_PATH)
# 
# print(f"‚úÖ Data written to: {OUTPUT_PATH}")

## 9. Cleanup

In [None]:
# Stop Spark session when done
# spark.stop()
# print("‚úÖ Spark session stopped")

---

## Troubleshooting

### Common Issues

1. **Java not found**: Install Java 8 or 11 and set `JAVA_HOME`
2. **S3 access denied**: Check your AWS credentials and bucket permissions
3. **Slow first run**: Spark downloads required JARs on first run (cached after)
4. **Memory errors**: Increase `spark.driver.memory` in session config

### Credential Priority

Spark checks credentials in this order:
1. Explicit configuration (from JSON file)
2. Environment variables (`AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`)
3. AWS CLI config (`~/.aws/credentials`)
4. IAM instance role (on EC2)