# Data Management for Machine Learning Assignment

## Problem Formulation: Customer Churn Prediction Pipeline

### 1. Business Problem Definition

Customer churn represents the loss of existing customers who stop using a company's services or products, leading to direct revenue decline and increased acquisition costs to replace lost customers. The challenge is to proactively identify which customers are likely to churn (specifically "addressable churn" where interventions are possible) by leveraging data from multiple sources and deploying a predictive machine learning model. This enables timely retention efforts and prevents further business impact from indirect effects, such as customers influencing others to leave or switching to competitors.

---

### 2. Key Business Objectives

- Proactively identify at-risk customers before they churn.
- Enable targeted retention strategies (e.g., personalized offers, customer support intervention).
- Reduce overall churn rate to minimize revenue loss and acquisition costs.
- Continuously monitor and improve model performance and retention strategy effectiveness.
- Support data-driven decision-making across departments (sales, marketing, product).

---

### 3. Key Data Sources and Attributes

#### Web logs
- `customer_id`
- `session_id`
- `timestamps`
- `activity_type` (pages viewed, time spent, etc.)

#### Transactional Systems
- `customer_id`
- `transaction_id`
- `transaction_date`
- `purchase_amount`
- `product_category`
- `payment_status`

#### Third-party APIs
- `customer_id`
- `demographic_info` (age, location)
- `sentiment_score`
- `social_media_engagement`

#### Additional Attributes
- Customer service interactions (call frequency, resolution time)
- Engagement metrics (last login, average activity frequency)
- Tenure (length of time as customer)

---

### 4. Expected Pipeline Outputs

- **Clean, validated datasets** suitable for exploratory data analysis (EDA).
- **Transformed feature sets** engineered specifically for machine learning.
- **Deployable predictive model** capable of identifying churn risk for individual customers.

---

### 5. Measurable Evaluation Metrics

- Accuracy: Correctly predicted churn/non-churn ratio
- Precision: Ratio of correctly predicted churns to all predicted as churn
- Recall: Ratio of correctly predicted churns to all actual churns
- F1 Score: Harmonic mean of precision and recall
- ROC AUC: Probability the model ranks a random positive example higher than a random negative one

**Optional:**
- Churn Rate Before vs After Intervention
- Lift in retention for at-risk segments
- Lead Time for churn prediction (how early model flags risk)

---
