Mini-Project!
Statistical Modeling and Interpretation of Results in Churn Analysis!

In [7]:
!pip install -q -U watermark

In [8]:
!pip install -q pandas==2.3.1 numpy==2.3.2 statsmodels==0.14.4 seaborn==0.13.2 matplotlib==3.8.0 plotly==5.24.1 ipython==8.30.0


ERROR: Cannot install matplotlib==3.8.0, numpy==2.3.2, pandas==2.3.1, seaborn==0.13.2 and statsmodels==0.14.4 because these package versions have conflicting dependencies.
ERROR: ResolutionImpossible: for help visit https://pip.pypa.io/en/latest/topics/dependency-resolution/#dealing-with-dependency-conflicts


In [9]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
from IPython.display import display, Markdown

In [10]:
#config to better display the data
pd.set_option('display.float_format', lambda x: '%.4f' % x)

In [11]:
%reload_ext watermark
%watermark -a "Eduardo" -a "Data Engineer" -a "2025-10-20" -a "Mini-Project" -a "Statistical Modeling and Interpretation of Results in Churn Analysis"

Author: Statistical Modeling and Interpretation of Results in Churn Analysis



In [12]:
%watermark --iversions

numpy      : 2.3.4
matplotlib : 3.10.7
statsmodels: 0.14.5
pandas     : 2.3.1
seaborn    : 0.13.2
IPython    : 9.6.0
plotly     : 6.3.1



display(Markdown("""
## 2. Business Problem Definition

### **2.1. Business Problem**

**Conecta Telecom**, a fictitious telecommunications company, is facing a churn rate above the industry average.

Management needs clear, data-driven answers to the following question:
**What are the main factors that lead our customers to cancel their services?**

- We want to understand the relationship between variables → Statistical Modeling
- If we want to predict churn → Predictive Modeling

### **2.2. Project Objectives**

This project aims to use statistical modeling to analyze customer data and achieve the following objectives:

1. Identify Key Factors: Determine which variables (such as contract type, loyalty period, invoice amount) have a direct impact on churn.
2. Quantify the Impact: Measure the influence of each factor on churn risk.
3. Generate Recommendations: Translate the results of the statistical analysis into actionable business recommendations for creating customer retention strategies.

The model chosen for this analysis will be Logistic Regression, as our goal is to understand the relationship between several variables and a binary variable (churn or no churn).
"""))

In [None]:
def dsa_gera_dados_churn(num_clientes=2000):
    """
    Generates a DataFrame of fictitious customer data from a telecommunications company.
    """

    # For reproducibility
    np.random.seed(42)

    # Variables
    fidelidade_meses = np.random.randint(1, 73, size=num_clientes)

    tipo_contrato_opts = ['Monthly', 'Annual', 'Two years']
    contrato_probs = [0.6, 0.25, 0.15]
    tipo_contrato = np.random.choice(tipo_contrato_opts, size=num_clientes, p=contrato_probs)

    servico_internet_opts = ['Fiber Optics', 'DSL', 'No']
    internet_probs = [0.55, 0.35, 0.10]
    servico_internet = np.random.choice(servico_internet_opts, size=num_clientes, p=internet_probs)

    fatura_base = {
        'Monthly': np.random.normal(60, 20),
        'Annual': np.random.normal(70, 25),
        'Two years': np.random.normal(80, 25)
    }

    fatura_mensal = [fatura_base[c] + fidelidade_meses[i] * 0.2 + np.random.normal(0, 5) 
                     for i, c in enumerate(tipo_contrato)]
    fatura_mensal = np.clip(fatura_mensal, 20, 120)

    # Logic for churn probability
    prob_churn_log = -2.5
    prob_churn_log += -0.05 * fidelidade_meses  # More loyalty, less chance of churn
    prob_churn_log += [3.0 if c == 'Monthly' else 1.5 if c == 'Anual' else 0 for c in tipo_contrato]
    prob_churn_log += [0.8 if s == 'Fiber Optics' else -0.5 if s == 'DSL' else 0 for s in servico_internet]
    prob_churn_log += 0.03 * np.array(fatura_mensal)  # Higher bills, more chance of churn

    prob_churn = 1 / (1 + np.exp(-np.array(prob_churn_log)))
    churn = np.random.binomial(1, prob_churn)

    # Construct DataFrame
    df = pd.DataFrame({
        'ID_Cliente': range(1, num_clientes + 1),
        'FidelidadeMeses': fidelidade_meses,
        'TipoContrato': tipo_contrato,
        'ServicoInternet': servico_internet,
        'FaturaMensal': fatura_mensal,
        'Churn': churn
    })

    return df
