<a name="Passo-1-estudo-de-churn---telecom-x"></a>
# Study of Churn study in Telecom X


The aim about this project is to analyze and predict customer churn for Telecom X, a fictitious company.

**Objective**:
1. Understand churn patterns.
2. Generate a model to predict churn (through machine learning).
3. Reflect on the data obtained to reduce churn.

📦 Source: https://raw.githubusercontent.com/alura-cursos/challenge2-data-science/refs/heads/main/TelecomX_Data.json

<a name="Load and Normalize "></a>














# Loading & Normalizing Json Data


In [28]:
import pandas as pd
import requests

# path
url = "https://raw.githubusercontent.com/alura-cursos/challenge2-data-science/refs/heads/main/TelecomX_Data.json"

# reading json data
df_raw = pd.read_json(url)
df_raw.head()


Unnamed: 0,customerID,Churn,customer,phone,internet,account
0,0002-ORFBO,No,"{'gender': 'Female', 'SeniorCitizen': 0, 'Part...","{'PhoneService': 'Yes', 'MultipleLines': 'No'}","{'InternetService': 'DSL', 'OnlineSecurity': '...","{'Contract': 'One year', 'PaperlessBilling': '..."
1,0003-MKNFE,No,"{'gender': 'Male', 'SeniorCitizen': 0, 'Partne...","{'PhoneService': 'Yes', 'MultipleLines': 'Yes'}","{'InternetService': 'DSL', 'OnlineSecurity': '...","{'Contract': 'Month-to-month', 'PaperlessBilli..."
2,0004-TLHLJ,Yes,"{'gender': 'Male', 'SeniorCitizen': 0, 'Partne...","{'PhoneService': 'Yes', 'MultipleLines': 'No'}","{'InternetService': 'Fiber optic', 'OnlineSecu...","{'Contract': 'Month-to-month', 'PaperlessBilli..."
3,0011-IGKFF,Yes,"{'gender': 'Male', 'SeniorCitizen': 1, 'Partne...","{'PhoneService': 'Yes', 'MultipleLines': 'No'}","{'InternetService': 'Fiber optic', 'OnlineSecu...","{'Contract': 'Month-to-month', 'PaperlessBilli..."
4,0013-EXCHZ,Yes,"{'gender': 'Female', 'SeniorCitizen': 1, 'Part...","{'PhoneService': 'Yes', 'MultipleLines': 'No'}","{'InternetService': 'Fiber optic', 'OnlineSecu...","{'Contract': 'Month-to-month', 'PaperlessBilli..."


<a name="data loading"></a>
# Data loading





The Telecom X JSON dataset already comes in tabular (flat) format, which makes it easy to read directly with `pandas.read_json`.
Source: [TelecomX_Data.json](https://raw.githubusercontent.com/alura-cursos/challenge2-data-science/refs/heads/main/TelecomX_Data.json)



In [29]:
# JSON URL
url = "https://raw.githubusercontent.com/alura-cursos/challenge2-data-science/refs/heads/main/TelecomX_Data.json"

# Reading JSON (planned already)
df = pd.read_json(url)

# Showing first lines
df.head()


Unnamed: 0,customerID,Churn,customer,phone,internet,account
0,0002-ORFBO,No,"{'gender': 'Female', 'SeniorCitizen': 0, 'Part...","{'PhoneService': 'Yes', 'MultipleLines': 'No'}","{'InternetService': 'DSL', 'OnlineSecurity': '...","{'Contract': 'One year', 'PaperlessBilling': '..."
1,0003-MKNFE,No,"{'gender': 'Male', 'SeniorCitizen': 0, 'Partne...","{'PhoneService': 'Yes', 'MultipleLines': 'Yes'}","{'InternetService': 'DSL', 'OnlineSecurity': '...","{'Contract': 'Month-to-month', 'PaperlessBilli..."
2,0004-TLHLJ,Yes,"{'gender': 'Male', 'SeniorCitizen': 0, 'Partne...","{'PhoneService': 'Yes', 'MultipleLines': 'No'}","{'InternetService': 'Fiber optic', 'OnlineSecu...","{'Contract': 'Month-to-month', 'PaperlessBilli..."
3,0011-IGKFF,Yes,"{'gender': 'Male', 'SeniorCitizen': 1, 'Partne...","{'PhoneService': 'Yes', 'MultipleLines': 'No'}","{'InternetService': 'Fiber optic', 'OnlineSecu...","{'Contract': 'Month-to-month', 'PaperlessBilli..."
4,0013-EXCHZ,Yes,"{'gender': 'Female', 'SeniorCitizen': 1, 'Part...","{'PhoneService': 'Yes', 'MultipleLines': 'No'}","{'InternetService': 'Fiber optic', 'OnlineSecu...","{'Contract': 'Month-to-month', 'PaperlessBilli..."


In [30]:

print(df_raw.columns)

Index(['customerID', 'Churn', 'customer', 'phone', 'internet', 'account'], dtype='object')


<a name="EDA"></a>
# EDA - Exploratory data analysis


In this step, we will visually and statistically explore the data to understand the behavior of Telecom X customers and identify patterns related to churn.


info about data

In [31]:
# General Dataframe info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7267 entries, 0 to 7266
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   customerID  7267 non-null   object
 1   Churn       7267 non-null   object
 2   customer    7267 non-null   object
 3   phone       7267 non-null   object
 4   internet    7267 non-null   object
 5   account     7267 non-null   object
dtypes: object(6)
memory usage: 340.8+ KB


In [32]:
# describe statistic
df.describe(include='all')

Unnamed: 0,customerID,Churn,customer,phone,internet,account
count,7267,7267,7267,7267,7267,7267
unique,7267,3,891,3,129,6931
top,9995-HOTOH,No,"{'gender': 'Male', 'SeniorCitizen': 0, 'Partne...","{'PhoneService': 'Yes', 'MultipleLines': 'No'}","{'InternetService': 'No', 'OnlineSecurity': 'N...","{'Contract': 'Month-to-month', 'PaperlessBilli..."
freq,1,5174,223,3495,1581,6


<a name="Nested data normalization"></a>
# Nested data normalization


Although the dataset looked flat at first glance, some columns like `customer`, `phone`, `internet`, and `account` contained JSON-formatted data (dictionaries). We used `pandas.Series` to expand these columns and transform each key into an individual column.


In [33]:
# Expanding nested columns
df_expanded = pd.concat([
    df_raw.drop(columns=['customer', 'phone', 'internet', 'account']),
    df_raw['customer'].apply(pd.Series),
    df_raw['phone'].apply(pd.Series),
    df_raw['internet'].apply(pd.Series),
    df_raw['account'].apply(pd.Series)
], axis=1)

# Rename ID column
df_expanded.rename(columns={"customerID": "id_client"}, inplace=True)

# Showing data normalized
df_expanded.head()


Unnamed: 0,id_client,Churn,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,Charges
0,0002-ORFBO,No,Female,0,Yes,Yes,9,Yes,No,DSL,No,Yes,No,Yes,Yes,No,One year,Yes,Mailed check,"{'Monthly': 65.6, 'Total': '593.3'}"
1,0003-MKNFE,No,Male,0,No,No,9,Yes,Yes,DSL,No,No,No,No,No,Yes,Month-to-month,No,Mailed check,"{'Monthly': 59.9, 'Total': '542.4'}"
2,0004-TLHLJ,Yes,Male,0,No,No,4,Yes,No,Fiber optic,No,No,Yes,No,No,No,Month-to-month,Yes,Electronic check,"{'Monthly': 73.9, 'Total': '280.85'}"
3,0011-IGKFF,Yes,Male,1,Yes,No,13,Yes,No,Fiber optic,No,Yes,Yes,No,Yes,Yes,Month-to-month,Yes,Electronic check,"{'Monthly': 98.0, 'Total': '1237.85'}"
4,0013-EXCHZ,Yes,Female,1,Yes,No,3,Yes,No,Fiber optic,No,No,No,Yes,Yes,No,Month-to-month,Yes,Mailed check,"{'Monthly': 83.9, 'Total': '267.4'}"


<a name="Charges column expand"></a>
# Expansion Charges column


The `Charges` column stored two financial values: the monthly charge and the total amount spent by the customer. To make analysis easier, we separated this information into two distinct columns: `monthly_charge` and `total_charge`. We also converted the `total_charge` column to a numeric type.

In [34]:
# Checking columns into 'account'
print(df_raw['account'].apply(pd.Series).columns)


Index(['Contract', 'PaperlessBilling', 'PaymentMethod', 'Charges'], dtype='object')


In [35]:
#  JSON File URL
url = "https://raw.githubusercontent.com/alura-cursos/challenge2-data-science/refs/heads/main/TelecomX_Data.json"

# Reading JSON
df_raw = pd.read_json(url)

# Expand nested columns except 'Charges' columns
df_expanded = pd.concat([
    df_raw.drop(columns=['customer', 'phone', 'internet', 'account']),
    df_raw['customer'].apply(pd.Series),
    df_raw['phone'].apply(pd.Series),
    df_raw['internet'].apply(pd.Series),
    df_raw['account'].apply(pd.Series)
], axis=1)

#  Expand 'Charges' column into 'account' column
df_expanded = pd.concat([
    df_expanded.drop(columns=['Charges']),
    df_expanded['Charges'].apply(pd.Series)
], axis=1)

#  Renanme monthly and total values
df_expanded = df_expanded.rename(columns={
    'Monthly': 'monthly_value',
    'Total': 'total_value'
})

#  Convert 'monthly_value' e 'total_value' to float, fixing errors and NaNs
df_expanded['monthly_value'] = pd.to_numeric(df_expanded['monthly_value'], errors='coerce').fillna(0)
df_expanded['total_value'] = pd.to_numeric(df_expanded['total_value'], errors='coerce').fillna(0)

# Rename ID column
df_expanded = df_expanded.rename(columns={"customerID": "id_client"})

#  showin 5 first line as df
df_expanded.head()


Unnamed: 0,id_client,Churn,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,...,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,monthly_value,total_value
0,0002-ORFBO,No,Female,0,Yes,Yes,9,Yes,No,DSL,...,Yes,No,Yes,Yes,No,One year,Yes,Mailed check,65.6,593.3
1,0003-MKNFE,No,Male,0,No,No,9,Yes,Yes,DSL,...,No,No,No,No,Yes,Month-to-month,No,Mailed check,59.9,542.4
2,0004-TLHLJ,Yes,Male,0,No,No,4,Yes,No,Fiber optic,...,No,Yes,No,No,No,Month-to-month,Yes,Electronic check,73.9,280.85
3,0011-IGKFF,Yes,Male,1,Yes,No,13,Yes,No,Fiber optic,...,Yes,Yes,No,Yes,Yes,Month-to-month,Yes,Electronic check,98.0,1237.85
4,0013-EXCHZ,Yes,Female,1,Yes,No,3,Yes,No,Fiber optic,...,No,No,Yes,Yes,No,Month-to-month,Yes,Mailed check,83.9,267.4


In [36]:
from google.colab import files

df_expanded.to_csv('/content/df_expanded.csv', index=False)

files.download('/content/df_expanded.csv')


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<a name="Visual Exploratory Analysis of Telecom X Churn"></a>
# Visual Exploratory Analysis of Telecom X Churn


What we did here:

* We filtered only rows with valid Churn values ('Yes' and 'No')

* We created an interactive chart showing the number of customers who canceled and those who stayed

* We used different colors for quick visualization

In [37]:
import plotly.express as px

def plot_churn_distribution(df, min_threshold=0.15):
    """
    Generates a bar chart of the distribution of customers who canceled (Churn).

Parameters:
- df (pd.DataFrame): DataFrame containing the 'Churn' column with values 'Yes' or 'No'.
- min_threshold (float): Minimum churn percentage value (between 0 and 1) to draw a reference line.

Returns:
- fig (plotly.graph_objects.Figure): Figure of the generated chart.
    """

    # backup
    df_churn = df.copy()

    # Ensure that the 'Churn' column contains only valid values
    df_churn = df_churn[df_churn['Churn'].isin(['Yes', 'No'])].copy()

    #  churn count and %
    churn_counts = df_churn['Churn'].value_counts()
    total = churn_counts.sum()
    percent_yes = (churn_counts['Yes'] / total) * 100
    percent_no = (churn_counts['No'] / total) * 100

    # Graph Build
    fig = px.histogram(
        df_churn,
        x='Churn',
        color='Churn',
        title='Churn distribution (Churn)',
        labels={'Churn': 'Cancel client?', 'count': 'Number of clients'},
        color_discrete_map={'Yes': 'crimson', 'No': 'mediumseagreen'},
        text_auto=True
    )

    # Add reference line about mininum churn limit
    min_churn_count = total * min_threshold
    fig.add_hline(
        y=min_churn_count,
        line_dash="dash",
        line_color="blue",
        annotation_text=f"Minimum Churn Limit: {min_threshold * 100:.0f}%",
        annotation_position="top left",
        annotation_font_color="blue"
    )

    # % notes
    fig.update_layout(
        xaxis_title='Churn (Cancel)',
        yaxis_title='Number of clients',
        showlegend=False,
        font=dict(size=14),
        annotations=[
            dict(
                x='Yes',
                y=churn_counts['Yes'],
                text=f"{percent_yes:.1f}%",
                showarrow=False,
                font=dict(color='crimson', size=16)
            ),
            dict(
                x='No',
                y=churn_counts['No'],
                text=f"{percent_no:.1f}%",
                showarrow=False,
                font=dict(color='mediumseagreen', size=16)
            )
        ]
    )

    return fig

# Build Graph about Churn distribution
fig_churn_distribution = plot_churn_distribution(df_expanded)

# Show Graph
fig_churn_distribution.show()





<a name="churn-distribution by contract type"></a>
#Churn of distribution by contract.


This chart is important because it helps us understand which contract types have the most customers canceling. It could indicate, for example, that customers with monthly contracts cancel more than those with annual contracts.


In [38]:
fig_contract = px.histogram(
    df_expanded,
    x='Contract',
    color='Churn',
    barmode='group',
    title=' Churn Contract type',
    labels={
        'Contract': 'Contract type',
        'count': 'Number of clients',
        'Churn': 'Cancel (Churn)'
    },
    color_discrete_map={'Yes': 'crimson', 'No': 'mediumseagreen'}
)

fig_contract.update_layout(
    xaxis_title='Contract Type',
    yaxis_title='Number of clients',
    font=dict(size=14)
)

fig_contract.show()


<a name="churn by payment method"></a>
#Churn by payment method


Let's create a grouped histogram to visualize the distribution of customers who canceled or not by payment type.

In [39]:
# Chart: Churn by payment method
fig_payment = px.histogram(
    df_expanded,
    x='PaymentMethod',
    color='Churn',
    barmode='group',
    title='Churn by payment method',
    labels={
        'PaymentMethod': 'Payment Method',
        'count': 'Number of clients',
        'Churn': 'Cancel (Churn)'
    },
    color_discrete_map={'Yes': 'crimson', 'No': 'mediumseagreen'}
)

fig_payment.update_layout(
    xaxis_title='Payment Method',
    yaxis_title='Number of clients',
    font=dict(size=14),
    xaxis_tickangle=-25
)

fig_payment.show()



What to note in this graph:

1. Do methods like "Electronic Check" tend to have higher churn?

2. Is there a preference for more traditional methods (ticket/mail)?

3. Does any category stand out for retaining more loyal customers?

<a name="churn-by-internet-type"></a>
#  Churn by internet type


Internet service can significantly impact customer experience. Let's explore how connection type (DSL, Fiber Optic, None) relates to churn.

In [40]:
# chart: Churn by internet type
fig_internet = px.histogram(
    df_expanded,
    x='InternetService',
    color='Churn',
    barmode='group',
    title='Churn by internet type',
    labels={
        'InternetService': 'type of internet',
        'count': 'Number of Clients',
        'Churn': 'Cancel (Churn)'
    },
    color_discrete_map={'Yes': 'crimson', 'No': 'mediumseagreen'}
)

fig_internet.update_layout(
    xaxis_title='Type of internet',
    yaxis_title='Number of clients',
    font=dict(size=14)
)

fig_internet.show()


 What to analyze:

-Are fiber optic customers canceling more?

-Are those without internet less likely to churn?

- Is DSL service more stable?

<a name="churn-telephone-service"></a>
#Churn by phone service


The chart show the relationship between telephone service and Churn of clients

In [41]:
# Churn by phoneservice
fig_phone = px.histogram(
    df_expanded,
    x='PhoneService',
    color='Churn',
    barmode='group',
    text_auto=True,
    title='Churn by phoneservice',
    labels={'PhoneService': 'Have phone service', 'Churn': 'Cancel'}
)

fig_phone.update_layout(
    xaxis_title='Phone service',
    yaxis_title='Number of clients',
    legend_title='Cancel',
    bargap=0.3,
    template='plotly_white',
    title_font_size=22
)

fig_phone.show()


## Predict Insight:
Most customers have phone service.

Comparing the churn rate between those who do and don't have phone service helps understand whether there's a relationship between this service and cancellation.

<a name="churn-paymentomode"></a>
# Churn by payment method


Relationship clients and payment method


In [42]:
# Churn by payment method
fig_payment = px.histogram(
    df_expanded,
    x='PaymentMethod',
    color='Churn',
    barmode='group',
    text_auto=True,
    title='Churn by Payment Method',
    labels={'PaymentMethod': 'Payment Method', 'Churn': 'Cancel'}
)

fig_payment.update_layout(
    xaxis_title='Payment method',
    yaxis_title='Number of clients',
    legend_title='Cancel',
    bargap=0.3,
    template='plotly_white',
    title_font_size=22
)

fig_payment.show()


# Predicted Insight:

Customers who use electronic checks tend to have a higher churn rate.

This may indicate dissatisfaction or vulnerability among less loyal customers (e.g., electronic payments without a loyalty agreement).

<a name="internet-type"></a>
# Churn by internet type

In [43]:
# Churn by Internet type
fig_internet = px.histogram(
    df_expanded,
    x='InternetService',
    color='Churn',
    barmode='group',
    text_auto=True,
    title='Churn by internet type',
    labels={'InternetService': 'Type of internet', 'Churn': 'Cancel'}
)

fig_internet.update_layout(
    xaxis_title='Type of internet',
    yaxis_title='Number of Clients',
    legend_title='Cancel',
    bargap=0.3,
    template='plotly_white',
    title_font_size=22
)

fig_internet.show()


##Predicted Insight

* Fiber optic customers tend to have higher churn rates compared to DSL and non-Internet customers.

* May indicate dissatisfaction with the performance or cost of fiber optic service.

<a name="chart"></a>
# Churn by monthly value distribution


Queremos ver se clientes que cancelaram pagavam valores diferentes dos que ficaram.

In [44]:
# Distribution monthly values
df_filtered = df_expanded[df_expanded['Churn'].isin(['Yes', 'No'])]

fig_monthly_value = px.box(
    df_filtered,
    x='Churn',
    y='monthly_value',
    color='Churn',
    title='Churn by monthly values',
    labels={'Churn': 'Cancel', 'monthly_value': 'Monthly Values (R$)'}
)

fig_monthly_value.update_layout(
    xaxis_title='Cancel',
    yaxis_title='Valor Mensal (R$)',
    showlegend=False,
    template='plotly_white',
    title_font_size=22
)

fig_monthly_value.show()

##What we need check
* Do customers who cancel (Churn = Yes) pay, on average, higher or lower amounts?

* If yes, this may indicate price sensitivity or dissatisfaction with the cost-benefit ratio.

<a name=" Stay time churn-"></a>
# Analysis of Churn and stay time


In this **boxplot** graph with Plotly Express, we compare the **tenure** of customers who:

- ❌ **cancelled the service (Churn = "Yes")**
- ✅ **continued (Churn = "No")**

---

### What to note:

- The median (center line of the box) shows the typical length of time customers stay with the company.
-  Customers who canceled generally have shorter tenure, indicating that most cancellations occur among customers with shorter tenure.
- The dispersion of the data shows variability in tenure for both groups.
-  This suggests that the company can focus on customer retention efforts in the first few months of their contract to reduce churn.

---

In [45]:
# filter ensure clean data in churn
df_plot = df_expanded[df_expanded['Churn'].isin(['Yes', 'No'])]

fig_tenure = px.box(
    df_plot,
    x='Churn',
    y='tenure',
    color='Churn',
    title=' Customer Stay Time by Cancellation Status',
    labels={
        'Churn': 'Cancel',
        'tenure': 'Stay time (month)'
    }
)

fig_tenure.update_layout(
    template='plotly_white',
    title_font_size=22,
    showlegend=False
)

fig_tenure.show()



#  Churn rate by contract type

What contract type has a influence in churn

In [46]:
# First, Solve churn by contract type
churn_contract_type = (
    df_expanded.groupby(['Contract', 'Churn'])
    .size()
    .reset_index(name='Amount')
)

# Check the rate, get total by contract
total_by_contract = churn_contract_type.groupby('Contract')['Amount'].transform('sum')
churn_contract_type['Percentual'] = churn_contract_type['Amount'] / total_by_contract * 100


# barchart with %
fig_churn_contract = px.bar(
    churn_contract_type,
    x='Contract',
    y='Percentual',
    color='Churn',
    color_discrete_map={'No': 'green', 'Yes': 'red'},
    labels={
        'Contract': 'Type of Internett',
        'Percentual': 'Percentual (%)',
        'Churn': 'Churn (Cancel)'
    },
    title='📉 Churn rate by Type of Internet  Telecom X',
    text=churn_contract_type['Percentual'].apply(lambda x: f'{x:.1f}%')
)

fig_churn_contract.update_layout(barmode='stack', yaxis=dict(ticksuffix='%'))
fig_churn_contract.show()

## Graph Analysis

This graph shows the churn rate (service cancellation) for each type of contract for Telecom X customers, grouped as follows:

- Monthly (Month-to-Month)
- Annual (1 Year) (One Year)
- Annual (2 Years) (Two Years)

The bars represent the percentage of customers who did not cancel (in green) and who did cancel (in red).

Important Insights:

Customers with monthly contracts have the highest churn rate, indicating a greater propensity to cancel.
One- or two-year contracts have a lower churn rate, suggesting greater loyalty.
It is recommended to focus on strategies to convert monthly customers to annual contracts to reduce churn.

<a name="churn--rate-by-contract-type"></a>
# Churn rate by contract time (tenure)


In [47]:
# make column to numeric churn: 1 to 'Yes', 0 to 'No'
df_expanded['Churn_num'] = df_expanded['Churn'].map({'Yes':1, 'No':0})

# Group by 'tenure' solvint churn rate
churn_by_ternure = df_expanded.groupby('tenure')['Churn_num'].mean().reset_index()

# Line chart type
fig_tenure = px.line(
    churn_by_ternure,
    x='tenure',
    y='Churn_num',
    labels={
        'tenure': 'Contract time (month)',
        'Churn_num': 'Churn rate'
    },
    title='Churn rate by contract time',
    markers=True
)

fig_tenure.update_layout(yaxis_tickformat=".0%")  # Show % in y axis
fig_tenure.show()


<a name="Report"></a>

# Report: Churn rate by contract time in Telecom X




This graph shows the customer churn rate as a function of contract length (in months).

We observed that churn is higher in the first few months of the contract, indicating that many customers cancel their relationship with Telecom X early. As the contract length increases, the churn rate decreases, suggesting greater loyalty among customers who remain beyond the initial period.

### Insight
Strategies focused on reducing churn in the first few months can significantly increase retention. Welcome programs, initial discounts, and personalized service are examples of recommended actions.

In [48]:
!pip install -U plotly



In [49]:
# Cell 1: Install Kaleido
!pip install -U kaleido



In [50]:
# Install Chrome using kaleido's built-in function, you need chrome browser
import kaleido
kaleido.get_chrome_sync()
print(" Google Chrome installed successfully by Kaleido.")

 Google Chrome installed successfully by Kaleido.


In [51]:
# Cell 2: Test if Kaleido is working
import plotly.io as pio

# Build a  bar chart
fig = px.bar(x=["A", "B", "C"], y=[1, 3, 2])

# Save image
pio.write_image(fig, "/content/test_kaleido.png")
print("Kaleido is working, saved image in /content/test_kaleido.png")

ModuleNotFoundError: No module named 'plotly.validators.layout.margin'

In [None]:
#  Chart Export as PNG images to /content/

# List names of figure variables
figures_to_save = [
    'fig_churn', 'fig_contract', 'fig_payment', 'fig_internet',
    'fig_phone', 'fig_monthly_value', 'fig_tenure', 'fig_churn_contract', 'fig'
]

for name_fig in figures_to_save:
    fig = globals().get(name_fig)
    if fig is not None:
        path = f"/content/{name_fig}.png"
        fig.write_image(path)
        print(f" {name_fig} saved em {path}")
    else:
        print(f" Figure {name_fig} not found.")


<a name="churn-analysis-in-telecom-x"></a>
# Churn Analysis - Telecom X


### Overview
- Churn: Share of customers who canceled (Churn = Yes) versus those who remained active (Churn = No).

### Detailed Analyses
- Contract Type: Month-to-month plans exhibit the highest likelihood of cancellation.
- Payment Method: Customers using electronic/online payments show a higher churn rate.
- Internet Services: Subscribers on fiber‑optic internet experience more churn compared with other connection types.
- Temporal Analysis: Churn tends to peak during the initial months of tenure.

### Recommendations
- Customer Retention: Prioritize retaining month‑to‑month customers by offering loyalty‑oriented plans and incentives.
- Customer Experience: Enhance the experience for fiber customers by identifying and addressing key drivers of dissatisfaction.
- Payment Policies: Reassess digital payment policies and encourage methods associated with stronger loyalty.
- Proactive Support: Invest in proactive onboarding and support during the early months of the contract, when churn risk is most acute.


## Full Report about analys

In [None]:
from IPython.display import display, Markdown

# dataset simulation for  example
data = {
    'Churn': ['YES', 'NO', 'YES', 'NO', 'NO', 'YES', 'NO', 'YES', 'NO', 'NO'],
    'contract': ['Monthly', 'Annual', 'Monthly', 'Monthly', 'Annual', 'Monthly', 'Annual', 'Monthly', 'Annual', 'Monthly'],
    'payment_method': ['eletronic', 'ticket', 'eletronic', 'card', 'card', 'eletronic', 'ticket', 'card', 'ticket', 'eletronic'],
    'Internet_tecnology': ['Fiber', 'ADSL', 'Fiber', 'Fiber', 'ADSL', 'Fiber', 'Fiber', 'ADSL', 'ADSL', 'Fiber'],
    'contract_time': [1, 12, 3, 2, 24, 1, 18, 2, 36, 1]
}

df = pd.DataFrame(data)

# Function to show title + text + chart
def show_section(title, text, fig):
    display(Markdown(f"### {title}"))
    display(Markdown(text))
    fig.show()

# Chart  1:  Churn proportion
fig_churn = px.pie(df, names='Churn', title=' Clients proportion that Cancel (Churn)')
show_section(
    "1. Overview about Churn",
    "This chart show the clients proportion that cancel (Churn = YES) versus stay clients (Churn = NO).",
    fig_churn
)

# Chart 2: Churn by contract type
contract_churn = df.groupby(['contract', 'Churn']).size().reset_index(name='count')
fig_contract = px.bar(contract_churn, x='contract', y='count', color='Churn', barmode='group',
                     title='Churn contract type')
show_section(
    "2. Analysis Type of contract",
    "Show the relationship between type of contract and churn rate. Mensal contract Clients has been more churn.",
    fig_contract
)

# Chart 3: Churn by Payment Method
pagamento_churn = df.groupby(['payment_method', 'Churn']).size().reset_index(name='count')
fig_payment = px.bar(pagamento_churn, x='payment_method', y='count', color='Churn', barmode='group',
                      title='Churn by payment method')
show_section(
    "3. Payment method Analysis",
    "We can show that client that has a  eletronic payment has a churn rate higher in comparison than other payment methods.",
    fig_payment
)

# Chart 4: Churn by  Internet tecnology
internet_churn = df.groupby(['Internet_tecnology', 'Churn']).size().reset_index(name='count')
fig_internet = px.bar(internet_churn, x='Internet_tecnology', y='count', color='Churn', barmode='group',
                    title='Churn Internet tecnology')
show_section(
    "4. Internet Services and Churn",
    "Chart show that Fiber óptic clients has been more churn in comparison than others tecnology.",
    fig_internet
)

# Chart 5: Churn by contract time (month)
churn_time = df.groupby(['contract_time', 'Churn']).size().reset_index(name='count')
fig_time = px.bar(churn_time, x='contract_time', y='count', color='Churn', barmode='group',
                   title='Temporal Churn  Analysis of contract time')
show_section(
    "5.Temporal Churn Analysis by contract time",
    "This Chart show a taxa de churn rate versus time to stay. Churn is higher in first months.",
    fig_time
)

# Final Recomendations
display(Markdown("""
---
## 🔍Recommendations for Telecom X

- **Focus on customer retention with monthly contracts**, offering plans with loyalty benefits.
- **Improve the customer experience with fiber optic services**, identifying and resolving causes of dissatisfaction.
- **Review electronic payment policies**, seeking to encourage methods that generate greater loyalty.
- **Invest in proactive support and service in the first months of the contract**, when churn is most critical.

---


"""))


## Custom Class for PDF Formatting

###  The code below generates a Custom Class
This code implements a class that allows you to format PDF documents and display reports efficiently and stylishly.

- ** Features**:
- Generate PDF reports with custom formatting.
- Include charts and tables directly in the document.
- Customization options for headers, footers, and styles.

### Report Display
The class also makes it easier to view the generated report, allowing users to access the information in a clear and organized manner.
---



In [None]:
!pip install fpdf

In [None]:
from fpdf import FPDF

# customer Class to set PDF format
class PDF(FPDF):
    def header(self):
        self.set_font("Arial", "B", 14)
        self.cell(0, 10, "Report Churn Analysis- Telecom X", ln=True, align="C")
        self.ln(10)

    def chapter_title(self, title):
        self.set_font("Arial", "B", 12)
        self.set_text_color(0)
        self.cell(0, 10, title, ln=True)
        self.ln(4)

    def chapter_body(self, body):
        self.set_font("Arial", "", 11)
        self.multi_cell(0, 8, body)
        self.ln()

# make PDF
pdf = PDF()
pdf.add_page()

# final recommendations
recommendations = """
Recommendations for Telecom X:

- Focus on retaining customers with monthly contracts, offering plans with loyalty benefits.
- Improve the customer experience with fiber optic services, identifying and resolving causes of dissatisfaction.
- Review electronic payment policies, seeking to encourage methods that generate greater loyalty.
- Invest in proactive support and service in the first months of the contract, when churn is most critical.

This report provides a clear and visual overview for strategic decision-making aimed at reducing churn and increasing customer loyalty.

Report automatically generated by data analysis with Python and Plotly in Colab.

"""

# Adding content
pdf.chapter_title("Final Recomendations")
pdf.chapter_body(recommendations)

# Export PDF
pdf_path = "/content/report_churn_telecomx.pdf"
pdf.output(pdf_path)

print(f"✅ PDF was created with sucess: {pdf_path}")


<a name="download-report"></a>
#  Report Download


Click the link below to download Telecom X's final churn report in PDF format, containing visual analysis and strategic recommendations generated with Python.

In [None]:
from IPython.display import FileLink

# 🔗 Link to download
FileLink('/content/report_churn_telecomx.pdf')


<a name="machine-learning"></a>
## Pre processing to  Machine Learning


Steps:
Separate predictor (X) and target (Y) variables

* Encode categorical variables with OneHotEncoder

* Split into training and testing

* Train the Random Forest model

* Evaluate performance with metrics

* Display variable importance

## Preprocessing for Machine Learning with Random Forest

In this step, we'll prepare the data to apply a Random Forest machine learning model.

The goal is to predict customer churn (cancellation) based on the features present in our `df_expandido` dataset.

### What we'll do:
- Create a numeric column for the target variable (`Churn`).
- Handle missing values.
- Encode categorical variables.
- Separate the data into training and testing sets.
- Train the Random Forest model.
- Evaluate model performance.

## Next Step: Preprocessing and Training the Random Forest Model

In this step, we prepare the data for the Machine Learning model and train a Random Forest classifier to predict customer churn.

### Steps Performed:

1. **Creating the numeric target variable (`Churn_num`):**
Converting the categorical variable `Churn` to binary values, where 'No' becomes 0 and 'Yes' becomes 1, making the algorithm's work easier.

2. **Missing Data Treatment:**
We remove records with null values in critical columns, such as `total_value` and `Churn_num`, to ensure the model receives clean data.

3. **Separating Explanatory Variables (Features) from Target Variables:**
We remove columns irrelevant to the model, such as `client_id` and the original `Churn` column.

4. **Encoding of categorical variables:**
We use LabelEncoder to transform textual variables into numbers, allowing the model to process this information.

5. **Dividing the data into training and testing:**
We reserve 20% of the data for testing, ensuring model validation on previously unseen data.

6. **Random Forest Training:**
We apply the Random Forest algorithm, a robust and efficient ensemble for classification, to learn patterns that indicate churn.

7. **Model Evaluation:**
We calculate performance metrics such as accuracy, confusion matrix, and detailed reporting to understand the effectiveness of the prediction.

---

This process paves the way for predictive analytics and a better understanding of the behavior of customers who cancel services, helping the company make strategic retention decisions.

In [None]:
# Verify total NaNs amount in  dataframe
total_nans = df_expanded.isna().sum().sum()
print(f"Total of values of NaN in dataframe: {total_nans}")

# Show NaNsby column
print("\nAmount NaNs by column:")
print(df_expanded.isna().sum())


<a name="model class"></a>
## Classifying model with Random Forest and class balance using smote

In this notebook, we will build a predictive model to identify customers who will cancel a service (churn) using Random Forest. The process includes the following main steps:

1. **Data Preparation**
- Separation of the independent variables (`features`) and the target variable (`target`), which indicates whether the customer churned or not.
- Identification of the numerical and categorical columns to apply the appropriate preprocessing.

2. **Preprocessing**
- Application of One-Hot Encoding to transform the categorical variables into a numerical format suitable for the model.
- Maintenance of the original numerical variables.

3. **Training/Test Split**
- We split the data into training and test sets, preserving the proportion of the target variable (`stratify=y`) to ensure representativeness.

4. Balancing with SMOTE
- The training set is balanced using SMOTE, a technique that generates synthetic samples for the minority class, preventing model bias.

5. Model Training and Optimization
- We train a Random Forest classifier.
- We use GridSearchCV to find the best model hyperparameters, evaluating them through 5-fold cross-validation and optimizing the F1-score metric.

6. Final Evaluation
- The optimized model is evaluated on the test set.
- The accuracy, precision, recall, and F1-score metrics are calculated, as well as the confusion matrix for detailed performance analysis.

This approach is robust to handling imbalanced data and combines careful preprocessing with automatic hyperparameter tuning to achieve high performance in the classification task.

---

In [None]:
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
import joblib

# Data Load by  CSV file
df = pd.read_csv('/content/df_expanded.csv')

# Column target name
target = 'Churn'

# See unique values of target column
print("Unique values in Churn:", df[target].unique())

# Map text valuesto 0 and 1
map_churn = {'No': 0, 'Yes': 1}
df[target] = df[target].map(map_churn)

# Remove lines in null target
df = df.dropna(subset=[target])

# Ensure target be int type
y = df[target].astype(int)

# features define
features = df.drop(columns=[target]).columns.tolist()
X = df[features]

# Num and cat vars
num_features = X.select_dtypes(include=['int64', 'float64']).columns.tolist()
cat_features = X.select_dtypes(include=['object', 'bool']).columns.tolist()

# Pipeline pre-process
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(handle_unknown='ignore'), cat_features)
    ],
    remainder='passthrough'
)

# data divided in train and  test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# Apply pre-process
X_train_prep = preprocessor.fit_transform(X_train)
X_test_prep = preprocessor.transform(X_test)

# Apply SMOTE to class balance
smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train_prep, y_train)

# Configure Random Forest
rf = RandomForestClassifier(random_state=42)

#  GridSearch paramns
param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [10, 20, None],
    'min_samples_split': [2, 5]
}

# GridSearch with 5 folds and  F1 metric
grid_search = GridSearchCV(rf, param_grid, cv=5, scoring='f1', n_jobs=-1)

# Train model with data balanced
grid_search.fit(X_train_smote, y_train_smote)

print("Best params founded:", grid_search.best_params_)

# Analysis the test set
y_pred = grid_search.predict(X_test_prep)

print("Acuracy:", accuracy_score(y_test, y_pred))
print("\nClassification report:\n", classification_report(y_test, y_pred))
print("Confusion matrix:\n", confusion_matrix(y_test, y_pred))

# --------- save model and preprocess ---------

# best train model
model_rf = grid_search.best_estimator_

# save model and pre process
joblib.dump(model_rf, 'model_rf.joblib')
joblib.dump(preprocessor, 'preprocessor.joblib')

print("Files model_rf.joblib and preprocessor.joblib saved with success!")

# --- Download  ---
try:
    from google.colab import files
    files.download('model_rf.joblib')
    files.download('preprocessor.joblib')
except ImportError:
    print("Dont work in this time.")


<a name="-"></a>
## Importance of features in analysis of Random Forest model


In this step, we evaluate which variables had the greatest impact on the predictive ability of the best Random Forest model obtained by `GridSearchCV`.

---

##  Extracting Feature Importances

The model's `feature_importances_` attribute returns a numerical score for each variable, indicating its relevance in tree construction and, consequently, in the model's decision-making.

---

##  Retrieving Feature Names After Preprocessing

Since the dataset has undergone transformations—primarily the `OneHotEncoder` applied to categorical variables—we need to:

- Obtain the category names expanded by the encoder;
- Match these names with the numeric features that were kept unchanged.

This ensures a correct correspondence between names and importance scores.

---

##  Organizing Data for Visualization

We created a DataFrame sorted in descending order of importance, facilitating the identification of the most relevant variables for the model.

---

##  Graphical Visualization

We present a horizontal bar chart with the **10 most important features**, which visually highlights the main factors that influence churn prediction.

---

### Importance of Analysis

Understanding the most relevant features is essential for:

- Interpreting the model's behavior;
- Identifying key variables for strategic decisions and interventions;
- Guiding actions focused on reducing churn based on concrete data.

---

> **Note:** This analysis complements the quantitative evaluation of the model, allowing for a qualitative view that adds value to the interpretation of the results.

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Best model train object
modelo_rf = grid_search.best_estimator_

# 1. Importance features
importances = modelo_rf.feature_importances_

# 2. Getting column names after preprocessing (OneHotEncoder + passthrough)
# This is a bit complex, as preprocessor is a ColumnTransformer

# Extract names offeatures from OneHotEncoder
ohe = preprocessor.named_transformers_['cat']
names_cat = ohe.get_feature_names_out(cat_features)

# Combine columns names transformade
feature_names = np.concatenate([names_cat, num_features])

# 3. Make Dataframe Visualization
df_importances = pd.DataFrame({
    'feature': feature_names,
    'importance': importances
}).sort_values(by='importance', ascending=False)

print(df_importances)

# 4. Plot chart of 10 features more important
plt.figure(figsize=(10,6))
plt.barh(df_importances['feature'].head(10)[::-1], df_importances['importance'].head(10)[::-1])
plt.xlabel('Importance')
plt.title('Top 10 Features more important no Random Forest')
plt.show()


<a name=""></a>
## Train and evaluation of  Random Forest model with 10 Features More important


In this block, we perform a new round of training and evaluation using **only the 10 most relevant variables** identified in the previous feature importance analysis step. The goal is to verify that the model maintains good performance with a reduced data set, which can bring benefits such as:

- Reduced computational complexity;
- Improved interpretability;
- Potential improvement in model generalization.

---
## Detailed Steps:

### 1. Extraction of the Most Important Features

- We use the `feature_importances_` attribute of the best previously trained Random Forest model to obtain the ranking of the variables.
- We retrieve the names of the features already transformed by the preprocessing pipeline (`OneHotEncoder` + other transformations).

### 2. Selection of the Top 10 Features

- We build a DataFrame sorted by importance and extract the top 10 variables. - We map the indices of these variables for selection in the preprocessed training and testing matrices.

### 3. Preprocessing with Selected Features

- We apply preprocessing (`transform`) to the original data to obtain the complete numerical matrices.
- We select only the columns corresponding to the 10 most important features.

### 4. Data Balancing with SMOTE

- We apply SMOTE only to the training set with the reduced features to correct the target class imbalance.

### 5. Training the Reduced Model

- We train a new Random Forest model with the best parameters found in `GridSearchCV`, but using only the selected features.

### 6. Model Evaluation

- We evaluate the accuracy, in addition to the classification report and confusion matrix, to measure the performance of the reduced version of the model on the testing set.

---

## Importance of this approach

Reducing the number of features without losing performance is a best practice for:

- Making models more efficient and faster;
- Facilitating the interpretation and communication of results to stakeholders;
- Minimizing potential noise caused by irrelevant variables.

--

> **Tip:** Always validate model performance with reduced features to ensure that simplification does not compromise prediction quality.

In [None]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

# --- Assuming you already have these objects after your initial training ---
# grid_search: GridSearchCV with RandomForest already trained
# preprocessor: ColumnTransformer with OneHotEncoder already fitted
# X_train, X_test, y_train, y_test: already defined

# 1. Extract importances from the best model
rf_model = grid_search.best_estimator_
importances = rf_model.feature_importances_

# 2. Get features names postpreprocessor (OneHotEncoder)
features_names = preprocessor.get_feature_names_out()

# 3. Make DataFrame to show importânces
df_importance = pd.DataFrame({
    'feature': features_names,
    'importance': importances
}).sort_values(by='importance', ascending=False)

print("Top 10 features more importants:")
print(df_importance.head(10))

top_features = df_importance['feature'].head(10).tolist()

# 4. Map  colunas index acorrd with top features
index_top_features = [i for i, f in enumerate(features_names) if f in top_features]

# 5. Select column top features train and test preprocessados
X_train_prep = preprocessor.transform(X_train)  # Check if X_train and X-tes are originals before fit_transform
X_test_prep = preprocessor.transform(X_test)

X_train_top = X_train_prep[:, index_top_features]
X_test_top = X_test_prep[:, index_top_features]

# 6. Apply SMOTE only in train with features selected
smote = SMOTE(random_state=42)
X_train_smote_top, y_train_smote_top = smote.fit_resample(X_train_top, y_train)

# 7. Train Random Forest witbh best  params in short set
rf_top = RandomForestClassifier(
    n_estimators=grid_search.best_params_['n_estimators'],
    max_depth=grid_search.best_params_['max_depth'],
    min_samples_split=grid_search.best_params_['min_samples_split'],
    random_state=42
)

rf_top.fit(X_train_smote_top, y_train_smote_top)

# 8. Evaluetad
y_pred_top = rf_top.predict(X_test_top)

print("Acuracy with top 10 features:", accuracy_score(y_test, y_pred_top))
print("\nClassification report with top 10 features:\n", classification_report(y_test, y_pred_top))
print("Confusion matrix with top 10 features:\n", confusion_matrix(y_test, y_pred_top))


<a name=""></a>
##  Evaluation of Model with Threshold adjust and  ROC curves/ Precision-Recall


**In this** code snippet, we delve deeper into the evaluation of the classification model balanced by SMOTE and optimized via Random Forest, focusing on:

- **Analysis of the predicted probabilities** for the positive class;
- **Visualization of the ROC and Precision-Recall curves**, essential metrics for evaluating binary classifiers, especially on imbalanced datasets;
- **Adjustment of the decision threshold** to find the optimal point that balances recall and precision, going beyond the default value (0.5).

---

## Steps performed:

### 1. Calculation of predicted probabilities

- We extract the predicted probabilities for the positive class (`y_prob`) using the `predict_proba` method from the best model found (`grid_search`).

### 2. Function for model evaluation at different thresholds

- We created a function that, given a threshold value, converts probabilities into binary predictions, calculates and prints:
- Detailed classification report (precision, recall, F1-score);
- Confusion matrix;
- Accuracy.

### 3. Plotting ROC and Precision-Recall curves

- **ROC curve:** shows the relationship between false positive rate (FPR) and true positive rate (TPR) for various thresholds.
- **Precision-Recall curve:** essential for evaluating models in imbalanced scenarios, highlighting the trade-off between precision and recall.
- We calculated and displayed the AUC-ROC and Average Precision (AP) metrics.

### 4. Practical testing of different thresholds

- We evaluated model performance for varying thresholds (0.3, 0.4, 0.5, 0.6, 0.7). - This analysis allows you to choose the threshold that best suits your business objectives, such as prioritizing a lower false negative rate (recall) or higher precision.

### 5. Final Evaluation with Adjusted Threshold

- We apply the chosen threshold (e.g., 0.4) to obtain the final prediction and evaluate its performance.

---

## Importance of Threshold Adjustment

- The default threshold of 0.5 may not be ideal, especially in cases of imbalanced classes.
- Adjusting the threshold can significantly improve model performance for specific objectives, such as minimizing false negatives in critical problems (e.g., churn, fraud, disease).
- The curves and metrics help you choose the most appropriate threshold.

---

> **Recommendation:** Always combine quantitative analysis with business knowledge to define the optimal threshold.

In [None]:
from sklearn.metrics import (
    roc_auc_score, roc_curve,
    precision_recall_curve, average_precision_score,
    classification_report, confusion_matrix
)

# Supponsing that you has grid_search trained and X_test_prep, y_test defined

# 1. Obtaining predict probab preditas to positive class
y_prob = grid_search.predict_proba(X_test_prep)[:, 1]

# 2. Define functionto evaluated model in differnet thresholds
def avaliar_threshold(threshold, y_prob, y_true):
    y_pred = (y_prob >= threshold).astype(int)
    print(f"\n=== Avaliação para threshold = {threshold:.2f} ===")
    print("Relatório de Classificação:")
    print(classification_report(y_true, y_pred))
    print("Matriz de Confusão:")
    print(confusion_matrix(y_true, y_pred))
    acc = (y_pred == y_true).mean()
    print(f"Acurácia: {acc:.4f}")
    return y_pred

# 3. Plot ROC curve and  Precision-Recall
fpr, tpr, thresholds_roc = roc_curve(y_test, y_prob)
precision, recall, thresholds_pr = precision_recall_curve(y_test, y_prob)
auc_roc = roc_auc_score(y_test, y_prob)
ap_score = average_precision_score(y_test, y_prob)

plt.figure(figsize=(12,5))

plt.subplot(1,2,1)
plt.plot(fpr, tpr, label=f'AUC-ROC = {auc_roc:.2f}')
plt.plot([0,1],[0,1],'k--')
plt.xlabel('Positive False')
plt.ylabel('True  Positive')
plt.title(' ROC Curve')
plt.legend()

plt.subplot(1,2,2)
plt.plot(recall, precision, label=f'AP = {ap_score:.2f}')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Curve Precision-Recall')
plt.legend()

plt.tight_layout()
plt.show()

# 4. Test diferent thresholds to find other thrsehold
thresholds_teste = [0.3, 0.4, 0.5, 0.6, 0.7]
for th in thresholds_teste:
    avaliar_threshold(th, y_prob, y_test)

# You can choice threshold to better balance recall and precision for your purpose
# For example, imagine that 0.4 is a goog value:
threshold_choiced = 0.4
y_pred_adjust = (y_prob >= threshold_choiced).astype(int)

print(f"\n===  final evaluated with  threshold adjusted = {threshold_choiced} ===")
print(classification_report(y_test, y_pred_adjust))
print("Confusion matrix:")
print(confusion_matrix(y_test, y_pred_adjust))


<a name="resume"></a>
## Resume of code


| Step | Description |
|-------|----------|
| 1 | Extract the **importances** of the features from the best model (`grid_search.best_estimator_`) |
| 2 | Use the `ColumnTransformer` to retrieve the **names of the already transformed variables** |
| 3 | Create a DataFrame `df_importancia` with the names and importances |
| 4 | Select the **10 most important attributes** |
| 5 | Extract these columns from the transformed training/test set |
| 6 | Apply **SMOTE** only to the training with these 10 columns |
| 7 | Train a new **Random Forest** with the **best hyperparameters** already found |
| 8 | Evaluate performance on the test set |

<a name="-"></a>
## Function to evaluate of Model with Threshold customer


This function facilitates the evaluation of binary classification models by adjusting the decision threshold to generate predictions and calculate performance metrics.

---

## Description of the `evaluate_threshold` function

```python
def evaluate_threshold(model, X_test, y_test, threshold=0.5):
# Obtaining the predicted probabilities for the positive class
probs = model.predict_proba(X_test)[:, 1]

# Applying the custom threshold to generate binary predictions
preds = (probs >= threshold).astype(int)

# Printing the evaluation metrics
print(f"=== Evaluation for threshold = {threshold:.2f} ===")
print("Classification Report:")
print(classification_report(y_test, preds))
print("Confusion Matrix:")
print(confusion_matrix(y_test, preds))
print("Accuracy:", accuracy_score(y_test, preds))
print("\n")

# Return the generated binary predictions
return preds

<a name="-"></a>
## Predict and evalueate with customer Threshold and  Save in CSV


This set of functions allows you to make predictions using a custom threshold to convert probabilities into binary classes, evaluate the predictions, and optionally save the results to a CSV file for later analysis.
---

## Function `predict_com_threshold`

### Description
Performs class prediction with an adjustable threshold, returning binary predictions and predicted probabilities. It also offers the option to save the predictions and probabilities in a CSV file.

### Parameters
- `model`: Trained model that implements the `predict_proba` method.
- `X`: Set of preprocessed features for prediction.
- `threshold` (float, default=0.5): Threshold value for converting probabilities into binary classes.
- `save_csv` (bool, default=False): Controls whether predictions and probabilities will be saved in CSV format.
- `filename` (str, default='predictions.csv'): Name of the CSV file to save the predictions.

### Returns
- `preds`: Array with binary predictions (0 or 1). - `probs`: Array with the probabilities of the positive class.

---

## `evaluate_predictions` Function

### Description
Evaluates the generated binary predictions, displaying important metrics for analyzing model performance.

### Parameters
- `y_true`: Array with the true classes (labels).
- `preds`: Array with the generated binary predictions.

### Metrics Displayed
- Full classification report (precision, recall, f1-score).
- Confusion matrix.
- Overall model accuracy.

---

## Usage Example

```python
# Set the desired threshold
threshold_choiced = 0.4

# Build predicts with threshold set and save in CSV
y_pred_custom, y_prob_custom = predict_with_threshold(
    model=grid_search.best_estimator_,
    X=X_test_prep,
    threshold=threshold_choiced,
    save_csv=True
)

# Evaluated predics versus real values
evaluation_predicts(y_test, y_pred_custom)

# Show the absolute path to saved file
import os
file_name = 'predicts.csv'
print("File saved in:", os.path.abspath(file_name))


<a name=""></a>
## Predict and evaluation with customer Threshold saving  in CSV


This set of functions allows you to make predictions by adjusting the threshold to convert probabilities into binary classes, as well as evaluate prediction performance and save the results to a CSV file.

---

## `predict_with_threshold` function

### Purpose
Make predictions with a custom threshold and return both binary predictions and predicted probabilities. Optionally, save the results to a CSV file for later analysis.

### Parameters
- `model`: trained model that has the `predict_proba` method.
- `X`: matrix or DataFrame with the preprocessed features for the prediction.
- `threshold` (float, default=0.5): cutoff value to transform probabilities into classes (0 or 1).
- `save_csv` (bool, default=False): indicates whether predictions and probabilities should be saved to a CSV file. - `filename` (str, default='predictions.csv'): Name of the CSV file where the results will be saved.

### Returns
- `preds`: Array containing the binary predictions (0 or 1).
- `probs`: Array with the probabilities of the positive class.

---

## `evaluate_predictions` Function

### Purpose
Evaluate and print performance metrics based on the binary predictions generated by the model, using the true values.

### Parameters
- `y_true`: Array with the true values of the classes.
- `preds`: Array containing the binary predictions of the model.

### Displayed Metrics
- Classification report (precision, recall, f1-score).
- Confusion matrix.
- Model accuracy.

---

## Usage Example

```python
# define threshold wished to predict
threshold_choiced = 0.4

# make predicts with custom threshold and save in CSV
y_pred_custom, y_prob_custom = predizer_com_threshold(
    model=grid_search.best_estimator_,
    X=X_test_prep,
    threshold=threshold_choiced,
    save_csv=True
)

# Evaluate predicts with true labels
evaluation_predicts(y_test, y_pred_custom)

# show the path of saved CSV
import os
file_name = 'predicts.csv'
print("File will save in :", os.path.abspath(file_name))


In [None]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

def predizer_com_threshold(modelo, X, threshold=0.5, salvar_csv=False, nome_arquivo='predicoes.csv'):
    """
    Realiza predição com threshold customizado, avalia e salva resultados.

    Parâmetros:
    - modelo: modelo treinado com método predict_proba
    - X: features pré-processadas para predição
    - threshold: float, limiar para converter probabilidades em classes
    - salvar_csv: bool, se True salva as predições em CSV
    - nome_arquivo: str, nome do arquivo CSV para salvar predições

    Retorna:
    - preds: array com predições binárias
    """
    # Pega probabilidades da classe positiva
    probs = modelo.predict_proba(X)[:, 1]

    # Aplica threshold customizado
    preds = (probs >= threshold).astype(int)

    # Avaliação só se houver rótulos verdadeiros (opcional)
    # Aqui assumimos que o usuário fornece X já separado para teste e tem y_test separado fora desta função
    # Caso queira passar y_test, podemos alterar a função

    if salvar_csv:
        df_result = pd.DataFrame({'probabilidade': probs, 'predicao': preds})
        df_result.to_csv(nome_arquivo, index=False)
        print(f'Predições salvas em {nome_arquivo}')

    return preds, probs

def avaliar_predicoes(y_true, preds):
    """
    Avalia as predições geradas e imprime relatório.

    Parâmetros:
    - y_true: array com valores verdadeiros
    - preds: array com predições binárias
    """
    print("Relatório de Classificação:")
    print(classification_report(y_true, preds))
    print("Matriz de Confusão:")
    print(confusion_matrix(y_true, preds))
    print("Acurácia:", accuracy_score(y_true, preds))

# Uso:

# Define threshold
threshold_escolhido = 0.4

# Prediz com threshold customizado
y_pred_custom, y_prob_custom = predizer_com_threshold(grid_search.best_estimator_, X_test_prep, threshold=threshold_escolhido, salvar_csv=True)

# Avalia com os dados verdadeiros
avaliar_predicoes(y_test, y_pred_custom)

import os

nome_arquivo = 'predicoes.csv'
print("Arquivo será salvo em:", os.path.abspath(nome_arquivo))



<a name="-"></a>
## Metrics evaluation in many  and graphics showing

This function allows you to evaluate the main classification metrics for different threshold values applied to the probabilities predicted by the model. The results are stored in a DataFrame, saved in a CSV file, and presented graphically to facilitate analysis of the impact of the threshold on the metrics.

---

## `evaluate_various_thresholds` function

### Purpose
Evaluate performance metrics (Precision, Recall, F1 Score, and Accuracy) of the model for a series of threshold values, allowing you to choose the best threshold for the problem at hand.

### Parameters
- `model`: trained model that has the `predict_proba` method.
- `X_test`: preprocessed test features.
- `y_test`: true labels corresponding to `X_test`.
- `thresholds`: array or list of float values representing the thresholds to be evaluated (default: values from 0.1 to 0.9 with a step of 0.1).

### Returns
- `df_results`: DataFrame containing the metrics calculated for each threshold.

---

## How it works
1. Obtains the predicted probabilities of the positive class for all test samples.
2. For each specified threshold:
- Converts the probabilities into binary predictions.
- Calculates the Precision, Recall, F1 Score, and Accuracy metrics.
- Stores the results in a list of dictionaries.
3. Creates a DataFrame from the results.
4. Saves the DataFrame to a CSV file named `'avaliacao_thresholds.csv'`.
5. Plots the metrics as a function of the threshold for clear visualization of trends.
6. Returns the DataFrame with the results for later use.

---

## Visualization

The generated graph shows the Precision, Recall, F1 Score, and Accuracy curves as a function of the different thresholds tested, allowing you to visually identify which threshold offers the best balance between the metrics.

---

## Usage Example
```python
df_evaluation = evaluation_many_thresholds(grid_search.best_estimator_, X_test_prep, y_test)

print(df_evaluation)


In [None]:
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score

def evaluation_many_thresholds(modelo, X_test, y_test, thresholds=np.arange(0.1, 1.0, 0.1)):
    resultados = []

    probs = modelo.predict_proba(X_test)[:,1]

    for thr in thresholds:
        preds = (probs >= thr).astype(int)
        precision = precision_score(y_test, preds)
        recall = recall_score(y_test, preds)
        f1 = f1_score(y_test, preds)
        acc = accuracy_score(y_test, preds)
        resultados.append({
            'threshold': thr,
            'precision': precision,
            'recall': recall,
            'f1_score': f1,
            'accuracy': acc
        })

    df_results = pd.DataFrame(resultados)

    # Save CSV
    df_results.to_csv('avaliacao_thresholds.csv', index=False)
    print("Saved results in 'avaliacao_thresholds.csv'")

    # Plotar
    plt.figure(figsize=(10,6))
    plt.plot(df_results['threshold'], df_results['precision'], marker='o', label='Precision')
    plt.plot(df_results['threshold'], df_results['recall'], marker='o', label='Recall')
    plt.plot(df_results['threshold'], df_results['f1_score'], marker='o', label='F1 Score')
    plt.plot(df_results['threshold'], df_results['accuracy'], marker='o', label='Accuracy')
    plt.xlabel('Threshold')
    plt.ylabel('Score')
    plt.title('Métricas vs Threshold')
    plt.grid(True)
    plt.legend()
    plt.show()

    return df_results

# Using as example:
df_evaluation = evaluation_many_thresholds(grid_search.best_estimator_, X_test_prep, y_test)

print(df_avaliacao)


<a name="-"></a>
## Report of analysis Thresholds to Churn predict – TelecomX


###  Purpose


533 / 5.000
The goal is to find the best decision threshold to correctly classify customers who will churn. Since this is a churn prediction problem, we place greater emphasis on:

- High recall of class 1 (churners) → Identifying the majority of customers who will churn.
- Balanced F1-score → Compromise between detecting churn and avoiding false positives.
- Reasonable precision → Avoiding wasting resources on customers who would not churn.

---

### 📋 Summary Table of Metrics by Threshold
| Threshold | Precision | Recall | F1-score | Accuracy | Comentário |
|-----------|-----------|--------|----------|----------|------------|
| **0.1**   | 0.33      | **0.98** | 0.50     | 0.48     | Higher recall, many false positive results. unfeasible. |
| **0.2**   | 0.40      | **0.93** | 0.56     | 0.61     | still many false positives. |
| **0.3**   | 0.47      | **0.86** | 0.61     | 0.71     | 🔸 acceptable initial equilibrium. |
| **0.4**   | 0.53      | 0.77     | **0.63** | 0.75     | ✅ Best F1-score until now, good recall. Good option for for churn. |
| **0.5**   | 0.57      | 0.63     | 0.60     | 0.78     | Low recall, about acuracy. |
| **0.6**   | 0.64      | 0.48     | 0.54     | 0.79     | Lost many recall. |
| **0.7**   | 0.73      | 0.29     | 0.42     | 0.78     | Low Recall . |
| **0.8**   | 0.83      | 0.09     | 0.16     | 0.75     | Poor to the churn. |
| **0.9**   | 0.00      | 0.00     | 0.00     | 0.73     | Ignorate  class 1. |

---

## 🟩 Conclusion: Best Threshold

The ideal threshold is 0.4, as it offers the best balance:

- Precision: 0.53
- Recall: 0.77 ✅
- F1-score: 0.63 (best among the tested thresholds)
- Accuracy: 0.75

🔍 Reason: Excellent ability to capture churners (high sensitivity) without compromising accuracy. Ideal for proactive decision-making by the BRA TeleCOM team.

---
## Recommendation

Use a threshold of 0.4 to classify customers as churners.

Optionally, monitor customers with a probability between 0.4 and 0.6 as a special attention group.

---

<a name=""></a>
##  How use the code to predict churn







Here is a complete block ready to run in a Python notebook that:

1. Creates a new customer entry;

2. Loads the model (modelo_rf.joblib) and preprocessor (preprocessor.joblib) files;

3. Preprocesses the input;

4. Applies the model to predict churn probability and decision;

5. Displays the result based on an adjustable threshold.

 Code Block: Churn Prediction for New Customer

In [None]:
import pandas as pd
import joblib

# === 1. New client input ===
new_input = pd.DataFrame([{
    'id_client': '9999-TESTE',
    'gender': 'Female',
    'SeniorCitizen': 0,
    'Partner': 'Yes',
    'Dependents': 'No',
    'tenure': 5,
    'PhoneService': 'Yes',
    'MultipleLines': 'No',
    'InternetService': 'Fiber optic',
    'OnlineSecurity': 'No', # Added missing column with a default value
    'OnlineBackup': 'Yes',
    'DeviceProtection': 'Yes',
    'TechSupport': 'No',
    'StreamingTV': 'Yes',
    'StreamingMovies': 'Yes',
    'Contract': 'Month-to-month',
    'PaperlessBilling': 'Yes',
    'PaymentMethod': 'Electronic check',
    'monthly_value': 89.9,
    'total_value': 450.3
}])

# === 2. Load and pr-process and saved models ===
preprocessor = joblib.load('preprocessor.joblib')
model_rf = joblib.load('model_rf.joblib')

# === 3. Apply pre-processin new input ===
new_input_prep = preprocessor.transform(new_input)

# === 4. Predict  churn probabl ===
prob_churn = model_rf.predict_proba(new_input_prep)[0, 1]

# === 5. Define threshold to convert in churn (ex: 0.4) ===
threshold = 0.4
churn_predicted = int(prob_churn >= threshold)

# === 6. Show  final result ===
print(f"Prob of Churn: {prob_churn:.2f}")
print("Final  Result:", "⚠️ Churn" if churn_predicted == 1 else "Not Churn")

In [52]:
# Cell 2: Test if Kaleido is working
import plotly.io as pio

# Build a  bar chart
fig = px.bar(x=["A", "B", "C"], y=[1, 3, 2])

# Save image
pio.write_image(fig, "/content/test_kaleido.png")
print("Kaleido is working, saved image in /content/test_kaleido.png")

ModuleNotFoundError: No module named 'plotly.validators.layout.margin'