<p style="text-align:center"> 
    <a href="https://www.linkedin.com/in/flavio-aguirre-12784a252/" target="_blank"> 
    <img src="../assets/logo.png" width="200" alt="Flavio Aguirre Logo"> 
    </a>
</p>

<h1 align="center"><font size="7"><strong>📉 ByeBye Predictor</strong></font></h1>
<br>
<hr>

## Telco Customer Data Preprocessing

At this stage, now that we know the dataset structures, we will focus on processing or curing the base dataset, that is, the "telco-customers-churn" dataset.

``Objective``: Prepare a clean, consistent dataset in a format suitable for use by machine learning algorithms. This allows us to maximize the performance of the predictive model.

In [129]:
# We import the necessary libraries for data manipulation.
import pandas as pd
import numpy as np

from sklearn.preprocessing import StandardScaler

# warning ignore
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

<br>

<hr>

``In the previous notebook, we had worked on some data, such as formatting the "Churn" column, but we hadn't saved that update. We'll do everything in this notebook.``

### ***Laod the dataset***

In [130]:
# Laod the dataset
print("\nLoading dataset...")
df_telco = pd.read_csv("../data/raw/telco-customer-churn.csv")
print("\nDataset loaded successfully.\n")

# Display the shape of the dataset
print(f"Dataset shape: {df_telco.shape}. \n")


Loading dataset...

Dataset loaded successfully.

Dataset shape: (7043, 21). 



As we had also seen in the EDA that was applied to this dataset in notebook 02, 2 types of data formats were defined, Numeric and categorical.

In [131]:
# We corroborate by separating each column with its data type
categorical_columns = df_telco.select_dtypes(include=['object']).columns
numerical_columns = df_telco.select_dtypes(include=[np.number]).columns

print("\nData types within the daset:\n")

print("Categorical columns:")
print(categorical_columns.tolist())
print("Total number of categorical columns:", len(categorical_columns))

print("\nNumerical columns:")
print(numerical_columns.tolist())
print("Total number of numerical columns:", len(numerical_columns))

# Detect numerically encoded binary
numeric_binary = [col for col in numerical_columns if df_telco[col].nunique() == 2]
continuous_numeric = [col for col in numerical_columns if col not in numeric_binary]

print("\nNumerically encoded binary columns:")
print(numeric_binary)
print("Total number of numerically encoded binary columns:", len(numeric_binary))
print("\nContinuous numeric columns:")
print(continuous_numeric)
print("Total number of numerically continuous columns:", len(continuous_numeric))


Data types within the daset:

Categorical columns:
['customerID', 'gender', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod', 'TotalCharges', 'Churn']
Total number of categorical columns: 18

Numerical columns:
['SeniorCitizen', 'tenure', 'MonthlyCharges']
Total number of numerical columns: 3

Numerically encoded binary columns:
['SeniorCitizen']
Total number of numerically encoded binary columns: 1

Continuous numeric columns:
['tenure', 'MonthlyCharges']
Total number of numerically continuous columns: 2


<br>

### Data Formatting

***``'TotalCharge'``***

We already know that despite being a categorical variable, ``'TotalCharge'`` refers to numerical values, and analyzing it provides much more context for understanding the relationship between customer tenure, monthly charges, and total billed amounts.

In [132]:
# Convert 'TotalCharges' to numeric type (replacing errors with NaN)
df_telco['TotalCharges'] = pd.to_numeric(df_telco['TotalCharges'], errors='coerce')
print(f"\n'TotalCharges' was successfully converted to type: {df_telco['TotalCharges'].dtypes}\n")

# Select numeric columns after conversion
numerical_columns = df_telco.select_dtypes(include=[np.number]).columns.tolist()
print("Numeric columns after conversion:")
print(numerical_columns)


'TotalCharges' was successfully converted to type: float64

Numeric columns after conversion:
['SeniorCitizen', 'tenure', 'MonthlyCharges', 'TotalCharges']


***``'Churn'``***

The target variable was also coded: ``'Churn'``

In [133]:
# Convert 'Churn' to binary values: 'Yes' → 1, 'No' → 0
print(f"\nInitial format: {df_telco['Churn'].dtypes}")
df_telco['Churn'] = df_telco['Churn'].map({'Yes': 1, 'No': 0})
print("\n'Churn' successfully converted to numeric (0 = No, 1 = Yes)\n")
print(f"'Churn' final data type: {df_telco['Churn'].dtypes}\n")


Initial format: object

'Churn' successfully converted to numeric (0 = No, 1 = Yes)

'Churn' final data type: int64



We add ``"Churn"`` to our ``"numerical_columns"`` list.

In [134]:
numerical_columns.append('Churn')
print(f"\nNumerical columns update: {numerical_columns}")


Numerical columns update: ['SeniorCitizen', 'tenure', 'MonthlyCharges', 'TotalCharges', 'Churn']


We observe how the dataset looks:

In [135]:
df_telco[numerical_columns].describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
SeniorCitizen,7043.0,0.162147,0.368612,0.0,0.0,0.0,0.0,1.0
tenure,7043.0,32.371149,24.559481,0.0,9.0,29.0,55.0,72.0
MonthlyCharges,7043.0,64.761692,30.090047,18.25,35.5,70.35,89.85,118.75
TotalCharges,7032.0,2283.300441,2266.771362,18.8,401.45,1397.475,3794.7375,8684.8
Churn,7043.0,0.26537,0.441561,0.0,0.0,0.0,1.0,1.0


<br>

### Missing Values


We confirm again how our daframe looks with its respective formatting in its data:

In [136]:
print(df_telco.dtypes)

customerID           object
gender               object
SeniorCitizen         int64
Partner              object
Dependents           object
tenure                int64
PhoneService         object
MultipleLines        object
InternetService      object
OnlineSecurity       object
OnlineBackup         object
DeviceProtection     object
TechSupport          object
StreamingTV          object
StreamingMovies      object
Contract             object
PaperlessBilling     object
PaymentMethod        object
MonthlyCharges      float64
TotalCharges        float64
Churn                 int64
dtype: object


Now let's see if there are any hidden missing values ​​with this change:

In [137]:
print("\nNull values in the dataset:")
missing_values = df_telco.isnull().sum()
missing_values = missing_values[missing_values > 0]
if not missing_values.empty:
    print(missing_values)


Null values in the dataset:
TotalCharges    11
dtype: int64


We see that we have ``11 missing values`` ​​in the ``TotalCharge`` column.
Since there are only a few missing values, we'll proceed to eliminate them. If there were many, we would impute them.

In [138]:
df_telco = df_telco.dropna(subset=['TotalCharges'])
print("\nNull values after dropping rows with NaN in 'TotalCharges':")
missing_values = df_telco.isnull().sum()
print(missing_values)


Null values after dropping rows with NaN in 'TotalCharges':
customerID          0
gender              0
SeniorCitizen       0
Partner             0
Dependents          0
tenure              0
PhoneService        0
MultipleLines       0
InternetService     0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
PaperlessBilling    0
PaymentMethod       0
MonthlyCharges      0
TotalCharges        0
Churn               0
dtype: int64


***Why?***

Incorrect or non-numeric values ​​in a continuous column prevent accurate model training.

For this reason we are going to refine it a little more.

<br>

### Type Conversion and Cleanup of Irrelevant Columns

``Objective:``

Remove columns with no predictive value (customerID), convert categorical variables to numeric (such as SeniorCitizen), and normalize names.

In [139]:
# We eliminate columns without predictive power
df_telco = df_telco.drop(columns=['customerID'])

# SeniorCitizen: Convert 0 and 1 to 'No' and 'Yes' for consistency with other columns
df_telco['SeniorCitizen'] = df_telco['SeniorCitizen'].map({1: 'Yes', 0: 'No'})

# We check for categorical data types
categorical_cols = df_telco.select_dtypes(include='object').columns.tolist()
print("Categorical variables:", categorical_cols)

Categorical variables: ['gender', 'SeniorCitizen', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod']


***Why?***

* ``CustomerID is an identifier; it does not provide predictive information.``

* ``Uniformizing SeniorCitizen`` allows it to be treated the same as other categorical variables by applying one-hot or label encoding.

### Encoding Categorical Variables

``Objective:``

Encode categorical variables so they are understandable by ML models. To do this, we'll use One-Hot Encoding for categorical variables.

In [140]:
# One-Hot Encoding (excluding target variable)
df_telco_encoded = pd.get_dummies(df_telco.drop(columns='Churn'), drop_first=True)
print("\nOne-Hot Encoding applied to categorical variables.\n")


One-Hot Encoding applied to categorical variables.



In [141]:
# Check the size of the processed dataset
print(f"\nProcessed dataset shape: {df_telco_encoded.shape}\n")


Processed dataset shape: (7032, 30)



In [142]:
# Check the first few rows of the encoded dataset
print("First few rows of the encoded dataset:")
df_telco_encoded.head()

First few rows of the encoded dataset:


Unnamed: 0,tenure,MonthlyCharges,TotalCharges,gender_Male,SeniorCitizen_Yes,Partner_Yes,Dependents_Yes,PhoneService_Yes,MultipleLines_No phone service,MultipleLines_Yes,...,StreamingTV_No internet service,StreamingTV_Yes,StreamingMovies_No internet service,StreamingMovies_Yes,Contract_One year,Contract_Two year,PaperlessBilling_Yes,PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check
0,1,29.85,29.85,False,False,True,False,False,True,False,...,False,False,False,False,False,False,True,False,True,False
1,34,56.95,1889.5,True,False,False,False,True,False,False,...,False,False,False,False,True,False,False,False,False,True
2,2,53.85,108.15,True,False,False,False,True,False,False,...,False,False,False,False,False,False,True,False,False,True
3,45,42.3,1840.75,True,False,False,False,False,True,False,...,False,False,False,False,True,False,False,False,False,False
4,2,70.7,151.65,False,False,False,False,True,False,False,...,False,False,False,False,False,False,True,False,True,False


We now see that we have 30 columns instead of the original 21.

***Why?***

* Machine learning models don't work with strings, so they must be transformed into numeric values.

* We use ``drop_first=True`` to avoid multicollinearity (dummy trap).

<br>

### Scaling of Numerical Variables

``Objective:``

Standardize data to test scale-sensitive algorithms (such as SVM, logistic regression).

In [143]:
scaler = StandardScaler()
numeric_cols = ['tenure', 'MonthlyCharges', 'TotalCharges']

df_telco_encoded[numeric_cols] = scaler.fit_transform(df_telco_encoded[numeric_cols])
print("\nStandardScaler applied to numeric columns.\n")


StandardScaler applied to numeric columns.



***Why?***

Some models (not all) assume that the data are centered at 0 and have variance 1. If we choose to do so, we can actually worsen the performance of the models.

<br>

### Saving the New Dataset

Once cleaned, coded, and unified, we proceed to save the newly processed dataset to be used as input for the future ML model.

In [144]:
# First we add the Churn variable again
df_telco_encoded['Churn'] = df_telco['Churn'].values
print("\nChurn column added back to the encoded dataset.\n")


Churn column added back to the encoded dataset.



In [145]:
# We verify the final dataset
print(f"\nFinal dataset shape: {df_telco_encoded.shape}\n")
print("Final dataset columns:")
print(df_telco_encoded.columns.tolist())


Final dataset shape: (7032, 31)

Final dataset columns:
['tenure', 'MonthlyCharges', 'TotalCharges', 'gender_Male', 'SeniorCitizen_Yes', 'Partner_Yes', 'Dependents_Yes', 'PhoneService_Yes', 'MultipleLines_No phone service', 'MultipleLines_Yes', 'InternetService_Fiber optic', 'InternetService_No', 'OnlineSecurity_No internet service', 'OnlineSecurity_Yes', 'OnlineBackup_No internet service', 'OnlineBackup_Yes', 'DeviceProtection_No internet service', 'DeviceProtection_Yes', 'TechSupport_No internet service', 'TechSupport_Yes', 'StreamingTV_No internet service', 'StreamingTV_Yes', 'StreamingMovies_No internet service', 'StreamingMovies_Yes', 'Contract_One year', 'Contract_Two year', 'PaperlessBilling_Yes', 'PaymentMethod_Credit card (automatic)', 'PaymentMethod_Electronic check', 'PaymentMethod_Mailed check', 'Churn']


In [146]:
# Let's look at the data type of the final dataset
print("\nData types in the final dataset:")
print(df_telco_encoded.dtypes)


Data types in the final dataset:
tenure                                   float64
MonthlyCharges                           float64
TotalCharges                             float64
gender_Male                                 bool
SeniorCitizen_Yes                           bool
Partner_Yes                                 bool
Dependents_Yes                              bool
PhoneService_Yes                            bool
MultipleLines_No phone service              bool
MultipleLines_Yes                           bool
InternetService_Fiber optic                 bool
InternetService_No                          bool
OnlineSecurity_No internet service          bool
OnlineSecurity_Yes                          bool
OnlineBackup_No internet service            bool
OnlineBackup_Yes                            bool
DeviceProtection_No internet service        bool
DeviceProtection_Yes                        bool
TechSupport_No internet service             bool
TechSupport_Yes                    

Once our columns are confirmed, we will standardize their names and then save them for the next step.

In [None]:
def clean_column_names(df):
    df.columns = (
        df.columns
          .str.strip()
          .str.lower()
          .str.replace(' ', '_')
          .str.replace('-', '_')
          .str.replace(r'[^\w_]', '', regex=True)
    )
    return df

# Clean column names
df_telco_encoded = clean_column_names(df_telco_encoded)
print("\nColumn names cleaned.\n")


Column names cleaned.



In [151]:
# Let's see the normalized column names
print("Normalized column names:")
print(df_telco_encoded.columns.tolist())

Normalized column names:
['tenure', 'monthlycharges', 'totalcharges', 'gender_male', 'seniorcitizen_yes', 'partner_yes', 'dependents_yes', 'phoneservice_yes', 'multiplelines_no_phone_service', 'multiplelines_yes', 'internetservice_fiber_optic', 'internetservice_no', 'onlinesecurity_no_internet_service', 'onlinesecurity_yes', 'onlinebackup_no_internet_service', 'onlinebackup_yes', 'deviceprotection_no_internet_service', 'deviceprotection_yes', 'techsupport_no_internet_service', 'techsupport_yes', 'streamingtv_no_internet_service', 'streamingtv_yes', 'streamingmovies_no_internet_service', 'streamingmovies_yes', 'contract_one_year', 'contract_two_year', 'paperlessbilling_yes', 'paymentmethod_credit_card_automatic', 'paymentmethod_electronic_check', 'paymentmethod_mailed_check', 'churn']


Finally, we save our curated dataset.

In [152]:
# We save the processed dataset
df_telco_encoded.to_csv("../data/processed/telco-customer-churn-processed.csv", index=False)
print("Processed dataset saved successfully.\n")

Processed dataset saved successfully.



<br>

<hr>

## Author

<a href="https://www.linkedin.com/in/flavio-aguirre-12784a252/">**Flavio Aguirre**</a>
<br>
<a href="https://coursera.org/share/e27ae5af81b56f99a2aa85289b7cdd04">***Data Scientist***</a>