In [None]:
# Apple Lead Data Scientist Python Test

This document outlines a comprehensive test designed to evaluate your data science and machine learning skills for a Lead Data Scientist position at Apple. You will work with a synthetic dataset that simulates user interactions on Apple's online store.

---

## Table of Contents

1. [Overview](#overview)
2. [Dataset Generation](#dataset-generation)
3. [Test Instructions](#test-instructions)
   - [Task 1: Data Loading and Preprocessing](#task-1-data-loading-and-preprocessing)
   - [Task 2: Exploratory Data Analysis (EDA)](#task-2-exploratory-data-analysis-eda)
   - [Task 3: Feature Engineering](#task-3-feature-engineering)
   - [Task 4: Model Building](#task-4-model-building)
   - [Task 5: Model Tuning and Evaluation](#task-5-model-tuning-and-evaluation)
   - [Bonus: Deployment and Business Context](#bonus-deployment-and-business-context)
4. [Skeleton Code Example](#skeleton-code-example)

---

## Overview

You are provided with a synthetic dataset, **`apple_data.csv`**, which simulates session data from Apple's online store. The dataset includes the following columns:

- **session_id:** Unique identifier for each session.
- **user_id:** Unique identifier for each user.
- **timestamp:** Timestamp of the session event.
- **device_type:** Type of device used during the session (e.g., "iPhone", "iPad", "Mac", "Apple Watch", "AirPods").
- **action:** The action taken by the user (e.g., "browse", "view", "click", "add_to_cart", "purchase").
- **price:** Price of the item involved in the session (if applicable).
- **product_category:** Category of the product (e.g., "smartphone", "tablet", "laptop", "wearable", "accessory").
- **purchase:** Binary indicator (0 or 1) indicating whether a purchase occurred during the session.

Your objective is to build a predictive model that estimates the likelihood of a purchase based on the session data. You should perform data ingestion, cleaning, exploratory analysis, feature engineering, model building, and evaluation. Bonus tasks will assess your thoughts on model deployment and business context.

---

## Dataset Generation

To simulate the scenario, use the following Python script to generate the synthetic dataset and save it as **`apple_data.csv`**:

```python
import numpy as np
import pandas as pd
from datetime import datetime, timedelta

# Set random seed for reproducibility
np.random.seed(42)

# Define the number of sessions to simulate
num_sessions = 1500

# Generate unique session IDs
session_ids = np.arange(1, num_sessions + 1)

# Generate random user IDs (simulate repeat customers)
user_ids = np.random.randint(1, 500, size=num_sessions)

# Generate random timestamps within the year 2023
base_date = datetime(2023, 1, 1)
random_days = np.random.randint(0, 365, size=num_sessions)
random_seconds = np.random.randint(0, 86400, size=num_sessions)
timestamps = [base_date + timedelta(days=int(d), seconds=int(s)) for d, s in zip(random_days, random_seconds)]

# Define device types with weighted probabilities
device_types = np.random.choice(
    ['iPhone', 'iPad', 'Mac', 'Apple Watch', 'AirPods'],
    size=num_sessions,
    p=[0.4, 0.2, 0.25, 0.1, 0.05]
)

# Define possible user actions with weighted probabilities
actions = np.random.choice(
    ['browse', 'view', 'click', 'add_to_cart', 'purchase'],
    size=num_sessions,
    p=[0.4, 0.3, 0.15, 0.1, 0.05]
)

# Generate random prices between $199 and $3000, rounded to 2 decimals
prices = np.round(np.random.uniform(199, 3000, size=num_sessions), 2)

# Determine product category based on device type
product_categories = []
for dt in device_types:
    if dt == 'iPhone':
        product_categories.append('smartphone')
    elif dt == 'iPad':
        product_categories.append('tablet')
    elif dt == 'Mac':
        product_categories.append('laptop')
    elif dt == 'Apple Watch':
        product_categories.append('wearable')
    else:
        product_categories.append('accessory')

# Generate purchase indicator:
# - If action is 'purchase': purchase = 1
# - Otherwise, assign purchase based on different probabilities:
#   * 'add_to_cart': 50% chance of purchase
#   * 'click': 20% chance
#   * 'view': 5% chance
#   * 'browse': 1% chance
purchase = []
for act in actions:
    if act == 'purchase':
        purchase.append(1)
    elif act == 'add_to_cart':
        purchase.append(1 if np.random.rand() < 0.5 else 0)
    elif act == 'click':
        purchase.append(1 if np.random.rand() < 0.2 else 0)
    elif act == 'view':
        purchase.append(1 if np.random.rand() < 0.05 else 0)
    else:  # 'browse'
        purchase.append(1 if np.random.rand() < 0.01 else 0)
purchase = np.array(purchase)

# Create the DataFrame
df = pd.DataFrame({
    'session_id': session_ids,
    'user_id': user_ids,
    'timestamp': timestamps,
    'device_type': device_types,
    'action': actions,
    'price': prices,
    'product_category': product_categories,
    'purchase': purchase
})

# Save the DataFrame to a CSV file
df.to_csv('../../data/apple_data.csv', index=False)
print("Dataset 'apple_data.csv' generated successfully!")
