# Lead Data Scientist Python Test

**Time Allotment:** 90–120 minutes

## Overview

You are provided with a synthetic dataset (`ebay_data.csv`) that simulates user session data on eBay. The dataset contains the following columns:

- **session_id:** Unique identifier for each session.
- **user_id:** Unique identifier for each user.
- **timestamp:** Timestamp of the event.
- **action:** Type of action (e.g., "view", "click", "add_to_cart", "purchase").
- **price:** Price of the item involved in the action.
- **product_category:** Category of the product.
- **purchase:** Binary indicator (0 or 1) showing whether a purchase occurred in the session (target variable).

Your task is to build a predictive model to estimate the likelihood of a purchase (i.e., `purchase` = 1) given the session data. You will perform data loading, cleaning, exploratory analysis, feature engineering, model building, and evaluation. There are also bonus questions to assess deeper insight and strategic thinking.

---

## Task 1: Data Loading and Preprocessing

1. **Load the Data:**  
   - Read the CSV file into a pandas DataFrame.
   - Display the first few rows to understand the data structure.

2. **Data Cleaning:**  
   - Check for and handle missing values and outliers.
   - Convert the `timestamp` column to a datetime object.
   - Create additional time-based features (e.g., hour of day, day of week).

---

## Task 2: Exploratory Data Analysis (EDA)

1. **Summary Statistics:**  
   - Compute descriptive statistics for numerical features.
   - Examine the distribution of the target variable (`purchase`).

2. **Visualization:**  
   - Plot the distribution of key features (e.g., `price`).
   - Visualize the relationship between features and the target variable (e.g., using box plots, histograms, or correlation matrices).

---

## Task 3: Feature Engineering

1. **Categorical Variables:**  
   - Encode categorical variables (e.g., `action`, `product_category`) using one-hot encoding or another appropriate method.
   
2. **Additional Features:**  
   - Consider creating new features (e.g., session duration if multiple timestamps per session are available, user behavior metrics, etc.).

---

## Task 4: Model Building

1. **Train-Test Split:**  
   - Split your data into training and testing sets.

2. **Model Selection:**  
   - Build at least one classification model (e.g., Logistic Regression, Random Forest, or XGBoost) to predict purchase behavior.
   - Train your model on the training set.

3. **Evaluation:**  
   - Evaluate your model’s performance on the test set using metrics such as accuracy, precision, recall, F1 score, and ROC AUC.
   - Provide a confusion matrix and classification report.

---

## Task 5: Model Evaluation and Tuning

1. **Hyperparameter Tuning:**  
   - Use cross-validation and GridSearchCV (or a similar approach) to optimize model hyperparameters
