![Cartoon of telecom customers](IMG_8811.png)


The telecommunications (telecom) sector in India is rapidly changing, with more and more telecom businesses being created and many customers deciding to switch between providers. "Churn" refers to the process where customers or subscribers stop using a company's services or products. Understanding the factors that influence keeping a customer as a client in predicting churn is crucial for telecom companies to enhance their service quality and customer satisfaction. As the data scientist on this project, you aim to explore the intricate dynamics of customer behavior and demographics in the Indian telecom sector in predicting customer churn, utilizing two comprehensive datasets from four major telecom partners: Airtel, Reliance Jio, Vodafone, and BSNL:

- `telecom_demographics.csv` contains information related to Indian customer demographics:

| Variable             | Description                                      |
|----------------------|--------------------------------------------------|
| `customer_id `         | Unique identifier for each customer.             |
| `telecom_partner `     | The telecom partner associated with the customer.|
| `gender `              | The gender of the customer.                      |
| `age `                 | The age of the customer.                         |
| `state`                | The Indian state in which the customer is located.|
| `city`                 | The city in which the customer is located.       |
| `pincode`              | The pincode of the customer's location.          |
| `registration_event` | When the customer registered with the telecom partner.|
| `num_dependents`      | The number of dependents (e.g., children) the customer has.|
| `estimated_salary`     | The customer's estimated salary.                 |

- `telecom_usage` contains information about the usage patterns of Indian customers:

| Variable   | Description                                                  |
|------------|--------------------------------------------------------------|
| `customer_id` | Unique identifier for each customer.                         |
| `calls_made` | The number of calls made by the customer.                    |
| `sms_sent`   | The number of SMS messages sent by the customer.             |
| `data_used`  | The amount of data used by the customer.                     |
| `churn`    | Binary variable indicating whether the customer has churned or not (1 = churned, 0 = not churned).|


In [94]:
# Import libraries and methods/functions
import pandas as pd
from sklearn.preprocessing import StandardScaler, OneHotEncoder 
from sklearn.model_selection import train_test_split
from sklearn.linear_model import RidgeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix

# Start your code here!

In [95]:
telecom_demo = pd.read_csv ('telecom_demographics.csv')

In [96]:
telecom_demo.head()

Unnamed: 0,customer_id,telecom_partner,gender,age,state,city,pincode,registration_event,num_dependents,estimated_salary
0,15169,Airtel,F,26,Himachal Pradesh,Delhi,667173,2020-03-16,4,85979
1,149207,Airtel,F,74,Uttarakhand,Hyderabad,313997,2022-01-16,0,69445
2,148119,Airtel,F,54,Jharkhand,Chennai,549925,2022-01-11,2,75949
3,187288,Reliance Jio,M,29,Bihar,Hyderabad,230636,2022-07-26,3,34272
4,14016,Vodafone,M,45,Nagaland,Bangalore,188036,2020-03-11,4,34157


In [97]:
telecom_usage = pd.read_csv('telecom_usage.csv')

In [98]:
telecom_usage.head()

Unnamed: 0,customer_id,calls_made,sms_sent,data_used,churn
0,15169,75,21,4532,1
1,149207,35,38,723,1
2,148119,70,47,4688,1
3,187288,95,32,10241,1
4,14016,66,23,5246,1


In [99]:
# 2. Merge dos DataFrames com base na coluna 'customer_id'
churn_df = pd.merge(telecom_demo, telecom_usage, on='customer_id', how='inner')

In [100]:
# 3. Calcular proporção de churn
# Supondo que a coluna que indica churn se chama 'churn' e contém valores booleanos ou 0/1
churn_rate = churn_df['churn'].mean()
print(f'Proporção de clientes que deram churn: {churn_rate:.2%}')

Proporção de clientes que deram churn: 20.05%


In [101]:
# 4. Identificar variáveis categóricas
categorical_columns = churn_df.select_dtypes(include='object').columns.tolist()
print("Variáveis categóricas:")
print(categorical_columns)

Variáveis categóricas:
['telecom_partner', 'gender', 'state', 'city', 'registration_event']


In [102]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# 1. Definir variável alvo (target) e features
target = 'churn'
X = churn_df.drop(columns=[target])
y = churn_df[target]

# 2. Identificar colunas categóricas e numéricas
categorical_cols = X.select_dtypes(include='object').columns.tolist()
numerical_cols = X.select_dtypes(include=['int64', 'float64']).columns.tolist()

# Remover o customer_id se estiver entre as variáveis numéricas
if 'customer_id' in numerical_cols:
    numerical_cols.remove('customer_id')

# 3. Criar um transformador para tratar cada tipo de variável
preprocessor = ColumnTransformer(transformers=[
    ('num', StandardScaler(), numerical_cols),
    ('cat', OneHotEncoder(drop='first', sparse=False), categorical_cols)
])

# 4. Aplicar transformação nos dados
features_scaled = preprocessor.fit_transform(X)
print(f'Dimensões finais das features escaladas: {features_scaled.shape}')


Dimensões finais das features escaladas: (6500, 1258)


In [103]:
from sklearn.model_selection import train_test_split

# 80% treino, 20% teste
X_train, X_test, y_train, y_test = train_test_split(
    features_scaled, y, test_size=0.2, random_state=42
)

# Verificando os tamanhos
print(f'Tamanho do X_train: {X_train.shape}')
print(f'Tamanho do X_test: {X_test.shape}')
print(f'Tamanho do y_train: {y_train.shape}')
print(f'Tamanho do y_test: {y_test.shape}')


Tamanho do X_train: (5200, 1258)
Tamanho do X_test: (1300, 1258)
Tamanho do y_train: (5200,)
Tamanho do y_test: (1300,)


In [104]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

# Criar e treinar um modelo de regressão logística

logreg = LogisticRegression(random_state=42, max_iter=1000)
logreg.fit(X_train, y_train)
logreg_pred = logreg.predict(X_test)

In [105]:
# Criar e treinar um modelo de Random Forest

rf = RandomForestClassifier(random_state=42)
rf.fit(X_train, y_train)
rf_pred = rf.predict(X_test)

In [106]:
from sklearn.metrics import accuracy_score

# 1. Calcular acurácias
logreg_acc = accuracy_score(y_test, logreg_pred)
rf_acc = accuracy_score(y_test, rf_pred)

# Comparar e atribuir nome do modelo com maior acurácia
if logreg_acc > rf_acc:
    higher_accuracy = "LogisticRegression"
else:
    higher_accuracy = "RandomForest"

# Apenas para debug local
print(f"Acurácia LogReg: {logreg_acc:.4f}")
print(f"Acurácia RF:     {rf_acc:.4f}")
print(f"Modelo com maior acurácia: {higher_accuracy}")


Acurácia LogReg: 0.7838
Acurácia RF:     0.7908
Modelo com maior acurácia: RandomForest


In [107]:
print(len(logreg_pred), len(rf_pred), len(y_test))


1300 1300 1300


In [108]:
if accuracy_score(y_test, logreg_pred) > accuracy_score(y_test, rf_pred):
    higher_accuracy = "LogisticRegression"
else:
    higher_accuracy = "RandomForest"

print(higher_accuracy)


RandomForest
