# Work Plan- Telecom Churn Prediction

Data sources
- contract.csv — contract info (customerID, Contract, MonthlyCharges, TotalCharges, PaymentMethod, EndDate, etc.)
- personal.csv — personal customer data (customerID, gender, SeniorCitizen, Partner, Dependents, etc.)
- internet.csv — internet services (customerID, OnlineSecurity, OnlineBackup, DeviceProtection, StreamingTV, StreamingMovies, InternetService)
- phone.csv — phone services (customerID, MultipleLines, PhoneService)
- merged_raw_snapshot.csv (merged data)
- merged_clean.csv (merged/clean data)
- Merge strategy: inner/left join on customerID to create master table (rows = unique customers).

## Notes about the contract database:

The goal of the Telecom Churn Project is to develop a model that will be able to forecast if the clients are planning to leave. Interconnect's goal is to learn if their client's are going to churn so they can offer them special deals, in hopes of getting them to stay. 

**What has been done in the Preprocessing stage:**

- Normalized column names (stripped whitespace).
- Read CSVs safely (low_memory=False) and printed shapes/heads.
- Checked for duplicate customerID rows (none present in raw files).
- Parsed dates:
    - BeginDate → datetime.
    - Kept EndDate raw, created EndDate_parsed = parsed date or NaT (preserved raw to avoid leakage).
- Computed snapshot-based tenure: tenure_months_snapshot from BeginDate → 2020-02-01.
- Coerced numeric columns to numeric:
    - MonthlyCharges and TotalCharges → numeric (errors='coerce').
- Deterministic imputation for TotalCharges:
    - 11 missing originally; filled with MonthlyCharges * tenure_months_snapshot (then fallback to MonthlyCharges and median) → no remaining TotalCharges NA.
- Merged the four tables on customerID (left-join on contract), saved merged snapshot CSV.
- Checked merge coverage / missingness:
    - InternetService missing = 1,526 (21.67%) — likely means "no internet service".
    - MultipleLines missing = 682 (9.68%) — likely means "no phone service".
- Converted common Yes/No flags into binary columns (Yes→1, No→0), normalized small categorical columns (strip).
- Created a small set of deterministic derived features:
    - num_internet_services (sum of internet-related flags)
    - has_internet (InternetService != 'No')
    - has_phone (from MultipleLines)
    - num_services = internet services + phone
    - avg_monthly_from_total = TotalCharges / tenure_months_snapshot (fallback to MonthlyCharges)
    - payment_auto flag (PaymentMethod contains 'automatic')
- Created two explicit label columns (kept both for transparency):
    - churn_left = EndDate_parsed.notna() (left before snapshot)
    - active_at_snapshot = EndDate_raw == 'No' (active at snapshot) — per Clarification Summary, I believe this is the assignment target by default
    - set chosen_target = active_at_snapshot 
- Saved artifacts:
    - data/processed/merged_clean.csv (cleaned table)
    - models/preprocessor_template.joblib (an unfitted ColumnTransformer template)
    - saved a merged_raw_snapshot for traceability

**Tasks to be done in the EDA stage:**
- Visualizations
- Checking distributions, skewness, seasonality, and outliers in detail.
- Identifying which engineered features help and which are redundant
- Deciding on encoding strategies after seeing cardinalities and rare categories (target encoding vs one-hot).
- Choosing class imbalance technique after seeing model baselines.

**Tasks in Feature Engineering stage:**
- Finalize service flags, tenure bins, payment/contract one-hots, and missingness indicators.
- Add ratios and avg_monthly_from_total variants, run quick RF probe and measure AUC gain.
- Test a small set of interactions and categorical groupings, then run feature selection/SHAP to prune.
- If dimensionality grows, apply reduction or target encoding carefully and re-evaluate.

Use stratified CV optimizing AUC-ROC: compare baseline model → add feature blocks (e.g., contract/payment, service counts, interactions) and measure delta AUC; use permutation importance / SHAP and simple ablation (remove features/groups) to validate usefulness and avoid overfitting. Keep features that consistently improve mean CV AUC and are stable across folds.

## Proposed Work Plan:

1. Download the data
2. Explore the data to determine how to treat it in the preprocessing stage
3. Perform preprocessing on the data
4. Perform Exploratory Data Analysis to explore the data in depth
5. Feature Engineering (Develop a model and prepare a report)

## Clarifying Questions:
1. First and foremost, I feel like I'm making this way more complicated and difficult than it needs to be. Am I getting in my head and over-complicating this and doing too much? Because before I turned this first part in I re-did the preprocessing code three times thinking I wasn't doing enough, which also may have caused problems.
2. Should missing InternetService/MultipleLines be treated as explicit 'No service' or kept as NaN (with a missingness indicator)?
3. Is there any preference for interpretable models (logistic regression / shallow trees) vs. black‑box ensembles? Do you expect a SHAP analysis or feature importance for the final submission?
4. The Clarification Summary says Target = EndDate == 'No', should I treat EndDate == 'No' (active) as the positive class, or do you want to model churn (EndDate != 'No') as positive?