# Final Report: Interconnect Churn Prediction Project

## 1. Steps performed and skipped

### Performed steps: 

* **Data Loading & Initial Exploration:**
    * Loaded the dataset and conducted an initial review to understand the structure, data types, and presence of null values.
  
* **Data Cleaning:**
    * Handled missing values using domain-appropriate strategies (e.g., filling NAs in service-related columns with "No").
      
    * Converted categorical variables such as Yes/No to binary format for model compatibility.
 
      
* **EDA (Exploratory Data Analysis):**

    * Visualized churn distribution, analyzed feature correlations, and understood the behavior of categorical and numerical features.
 
* **Feature Engineering:**

    * One-hot encoded nominal categorical features.

    * Created a binary churn column as the target variable.
 
* **Handling Class Imbalance with SMOTE:**
    * Applied SMOTE after train-test split to avoid data leakage.
    * Balanced the minority (churn) class to improve model generalization.
 
  
* **Train/Test Split:**

    * Split the dataset into training and testing subsets with stratification on the target variable.
    
  
* **Model Training:**

    * Trained 6 models:
        * Logistic Regression
        * Random Forest
        * XGBoost
        * Decision Tree
        * CatBoost
        * LightGBM
        

* **Model Evaluation:**

    * Used metrics like ROC AUC, precision, recall, and F1-score. Also visualized confusion matrices and ROC curves. Ignored accuracy due to imbalance.
  
* **Hyperparameter Tuning:**

    * Used RandomSearchCV for optimizing model performance.
 
### Skipped Steps:

* **Deep Learning Models:**
    * Skipped due to having a relatively small dataset size. Tree-based or linear models were more suitable.
  
* **Complex Feature Engineering:**

    * Advanced NLP wasn't necessary as most variables were structured and categorical.
  
* **Production Deployment:**

    * Not required for this project scope; focus was on model development and evaluation.

## 2. Difficulties Encountered & Solutions

* **Imbalanced Target Classes:**

    * Initially, churners were a small portion of the dataset, which hurt recall.
    * Solution: Applied SMOTE on the training set to synthetically generate churn samples, which boosted model recall and balanced learning.

## 3. Key Steps to Solving the Task

* **EDA:**
    * Understanding churn drivers helped prioritize which features might be most predictive (e.g., contract type, monthly charges).
  
* **Feature Selection:**

    * Dropping low-variance and redundant features improved model performance.
  
* **Hyperparameter Tuning:**

    * Helped fine-tune Random Forest and XGBoost models for higher ROC AUC and better generalization. Although some models attained worse scores after tuning.

* **Model Comparison:**

    * Testing multiple models revealed the best tradeoff between interpretability and performance.

## 4. Final Model & Performance

* **Selected Model:** LightGBM Classifier

* **Parameters Used:** Tuned via RandomSearchCV on validation set.

* **Key Features:** Contract Type, Begin Date (Tenure), Monthly Charges, Payment Method

* **Threshold Used:** 0.1 used to increase recall


* **Test Set Performance:**

    * Precision: 0.41
    * Recall: 0.94
    * F1-Score: 0.57
    * ROC AUC: 0.83

## Conclusion

The final LightGBM model demonstrates strong recall (0.94) and solid discriminative ability (ROC AUC = 0.83), which is ideal for a churn detection use case. 

Precision was expectedly lower due to the aggressive threshold of 0.1, but this tradeoff is acceptable when the business goal is to retain as many at-risk customers as possible.

SMOTE and threshold adjustment were pivotal to achieving these results.