<p style="text-align:center"> 
    <a href="https://www.linkedin.com/in/flavio-aguirre-12784a252/" target="_blank"> 
    <img src="../assets/logo.png" width="200" alt="Flavio Aguirre Logo"> 
    </a>
</p>

<h1 align="center"><font size="7"><strong>📉 ByeBye Predictor</strong></font></h1>
<br>
<hr>

## TELCO Feature Engineering

Now that we've created our baselines and are able to gather very important data on how they performed with raw data alone, we'll now try to improve their performance by enhancing the dataset with new labels, thus providing more information to the models, allowing them to find better patterns and relationships in the data.

``Objective:`` Improve the quality of the Telco dataset by transforming and creating new variables to optimize the performance of churn prediction models.

In [63]:
import pandas as pd
import numpy as np

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, recall_score

import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

<br>

<hr>

### Load Dataset

In [71]:
df = pd.read_csv('../data/processed/telco-customer-churn-processed.csv')
df.head()

Unnamed: 0,tenure,monthlycharges,totalcharges,gender_male,seniorcitizen_yes,partner_yes,dependents_yes,phoneservice_yes,multiplelines_no_phone_service,multiplelines_yes,...,streamingtv_yes,streamingmovies_no_internet_service,streamingmovies_yes,contract_one_year,contract_two_year,paperlessbilling_yes,paymentmethod_credit_card_automatic,paymentmethod_electronic_check,paymentmethod_mailed_check,churn
0,-1.280248,-1.161694,-0.994194,False,False,True,False,False,True,False,...,False,False,False,False,False,True,False,True,False,0
1,0.064303,-0.260878,-0.17374,True,False,False,False,True,False,False,...,False,False,False,True,False,False,False,False,True,0
2,-1.239504,-0.363923,-0.959649,True,False,False,False,True,False,False,...,False,False,False,False,False,True,False,False,True,1
3,0.512486,-0.74785,-0.195248,True,False,False,False,False,True,False,...,False,False,False,True,False,False,False,False,False,0
4,-1.239504,0.196178,-0.940457,False,False,False,False,True,False,False,...,False,False,False,False,False,True,False,True,False,1


### Creating new features with business sense

We must understand that these new variables should allow us to capture non-obvious patterns, such as those based on domain experience and customer behavior.

Therefore, the following new features are created:

* ``Average monthly spending``:

**Why?** 

Customers who pay more monthly may be at greater risk of churn if they don't perceive value.

In [72]:
# avg_monthly_spend:
df["avg_monthly_spend"] = df["totalcharges"] / (df["tenure"] + 1)

<br>

* ``New Customer`` (less than 6 months):

**Why?** 

New customers tend to have a higher churn rate (they're not yet loyal).

In [73]:
# is_new_customer:
df["is_new_customer"] = df["tenure"] < 6

<br>

* ``Loyal Customer`` (more than 60 months):

**Why?** 

Long-term customers may have different patterns and a lower risk of churn.

In [74]:
# is_loyal_customer:
df["is_loyal_customer"] = df["tenure"] > 60

<br>

* ``High-spending customer`` (monthly charges > average):

**Why?**

 High-spending customers tend to have higher expectations and be more sensitive to issues or prices.

In [75]:
# high_spender
df["high_spender"] = df["monthlycharges"] > df["monthlycharges"].mean()

<br>

* ``Fiber optics = long-term contract``:

**Why?** 

This combination may indicate loyalty (premium service + long-term contract).

In [76]:
# has_fiber_contract:
df["has_fiber_contract"] = df["internetservice_fiber_optic"] & (
    df["contract_two_year"] | df["contract_one_year"]
)

<br>

* ``Total number of contracted services``:

**Why?** 

The more contracted services, the more expensive it is for the customer to switch providers.

In [77]:
# num_services
service_cols = [
    "phoneservice_yes", "multiplelines_yes", "onlinebackup_yes", "onlinesecurity_yes",
    "deviceprotection_yes", "techsupport_yes", "streamingtv_yes", "streamingmovies_yes"
]

df["num_services"] = df[service_cols].sum(axis=1)

<br>

### Checking for new variables

In [78]:
df[[
    "avg_monthly_spend", "is_new_customer", "is_loyal_customer",
    "high_spender", "has_fiber_contract", "num_services"
]].describe()

Unnamed: 0,avg_monthly_spend,num_services
count,7032.0,7032.0
mean,-2.01154,3.363339
std,22.031629,2.062067
min,-191.990773,0.0
25%,-0.450478,1.0
50%,0.34008,3.0
75%,0.941223,5.0
max,26.95756,8.0


As we can see, it only shows two columns.

***Why?***

Bool columns sometimes confuse the overall summary, but they work perfectly when used as numeric (0/1).

Therefore, let's transform the Booleans for a moment to better appreciate the statistical summary of the new variables.

In [79]:
df[[
    "avg_monthly_spend", "num_services", 
    "is_new_customer", "is_loyal_customer", 
    "high_spender", "has_fiber_contract"
]].astype(int).describe()

Unnamed: 0,avg_monthly_spend,num_services,is_new_customer,is_loyal_customer,high_spender,has_fiber_contract
count,7032.0,7032.0,7032.0,7032.0,7032.0,7032.0
mean,-2.166524,3.363339,1.0,0.0,0.557594,0.137656
std,21.903338,2.062067,0.0,0.0,0.496707,0.344564
min,-191.0,0.0,1.0,0.0,0.0,0.0
25%,0.0,1.0,1.0,0.0,0.0,0.0
50%,0.0,3.0,1.0,0.0,1.0,0.0
75%,0.0,5.0,1.0,0.0,1.0,0.0
max,26.0,8.0,1.0,0.0,1.0,1.0


This statistical summary of the new variables allows us to quickly verify whether they are correctly calculated and whether they make sense, providing predictive power to the model.

<br>

### Final Summary

Now with our new variables associated with our dataframe:

| New Variable | Type | Business Sense |
| -------------------- | -------- | ------------------------------------------------------------------- |
| `avg_monthly_spend` | Numeric | Average monthly spend, possible indicator of dissatisfaction if it is high |
| `is_new_customer` | Boolean | New customers tend to leave more quickly |
| `is_loyal_customer` | Boolean | Loyal customers may have a low probability of churn |
| `high_spender` | Boolean | Higher-paying customers may be more demanding |
| `has_fiber_contract` | Boolean | Premium customers with long-term contracts tend to be more stable |
| `num_services` | Numeric | The more services you have, the lower the probability of churn |

Let's run a quick test to see how these new features affect the model.

<br>

### Rapid Validation with Random Forest

**Why do we only use Random Forest at this point?**

A robust model like Random Forest is enough to validate the value of the new features because:
* It captures nonlinear relationships
* It gives us a solid idea of ​​whether the new variables improve churn recall
* We don't need to spend resources training the five models just for pre-validation.

<br>

We should also emphasize that our mission is to improve the recall metric, as it allows us to detect customers who are about to abandon the service (churn = 1). Keeping in mind that recall (or sensitivity) is the metric that measures the model's ability to correctly identify positive cases.

This translates to:
``Detecting those who leave, before they leave.``

In [80]:
# Define features and targets
X = df.drop(columns="churn")
y = df["churn"]

# Splitting data with a test_size of 0.3
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, stratify=y, random_state=42
)

# Fast model with Random Forest
clf = RandomForestClassifier(random_state=42)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

# Evaluation
print("Classification report:\n")
print(classification_report(y_test, y_pred, digits=4))

recall_churn = recall_score(y_test, y_pred)
print(f"\nRecall (churning customers): {recall_churn:.4f}")

Classification report:

              precision    recall  f1-score   support

           0     0.8244    0.8974    0.8594      1549
           1     0.6250    0.4724    0.5381       561

    accuracy                         0.7844      2110
   macro avg     0.7247    0.6849    0.6987      2110
weighted avg     0.7714    0.7844    0.7739      2110


Recall (churning customers): 0.4724


### Comparison: before vs. after

Now we compare the baseline metrics with those of this model trained with the enriched dataset.

| Metric | Baseline (31 vars) | Model with new features (37 vars) |
| ---------- | ------------------------- | ---------------------------------- |
| Accuracy | 0.7863 | 0.7844 |
| Precision | 0.6273 | 0.6250 |
| ``Recall`` | ``0.4831`` | ``0.4724`` |
| F1 Score | 0.5458 | 0.5381 |

**What happens to the model with the new variables?**

***Very similar results:***
Both versions of the model have very similar performance. The metrics are within a difference of less than 1.5%.

None of the models clearly dominates the other, which is very valuable in itself.

``But most importantly``: the new features make business sense.
Although the model with the new variables isn't dramatically improved, it isn't worse either, and it adds interpretability and actionability.

***Why is this good?***

It allows us to say:

"We can detect churn with the same performance as before, but now we know that customers with certain characteristics, such as being new, spending a lot, or having fiber optics, are more at risk."

These are levers the business can use (retargeting, discounts, etc.)

<br>

### We save the dataset with the new features

We conclude this notebook by saving the updated dataset for the next step, which will be to optimize these models with all the variables before selecting the best ones.

In [81]:
df.to_csv("../data/processed/telco-feature-engineering.csv", index=False)
print("\nSaved enriched dataset: telco-feature-engineering.csv")


Saved enriched dataset: telco-feature-engineering.csv


<br>

<hr>

## Author

<a href="https://www.linkedin.com/in/flavio-aguirre-12784a252/">**Flavio Aguirre**</a>
<br>
<a href="https://coursera.org/share/e27ae5af81b56f99a2aa85289b7cdd04">***Data Scientist***</a>