## Step 1: Task 1 - Data Collection and Validation

The first step is loading the dataset and performing initial checks to validate that the data was acquired correctly and to identify any immediate data type or missing value issues.

In [3]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, StandardScaler, OneHotEncoder
import warnings
warnings.filterwarnings('ignore')

print("--- 1. Data Collection: Load Dataset ---")

# Assuming the file is located at 'datasets/WA_Fn-UseC_-Telco-Customer-Churn.csv'
try:
    df = pd.read_csv('datasets/WA_Fn-UseC_-Telco-Customer-Churn.csv')
    print("‚úÖ Dataset loaded successfully.")
except FileNotFoundError:
    print("‚ùå Error: CSV file not found. Please verify the file path.")
    df = pd.DataFrame() # Create empty DF to prevent downstream errors

# --- Data Acquisition Validation (Task 2, Subtask 1) ---
if not df.empty:
    print(f"\nDataset Shape: {df.shape}")
    print("\nInitial Data Information:")
    df.info()

    # Display the first few rows to inspect the data
    print("\nFirst 5 Rows for Inspection:")
    print(df.head())

# Create a working copy for preprocessing
df_clean = df.copy()

print("\n‚úÖ Step 1 Complete: Data loaded and copy created for cleaning.")

--- 1. Data Collection: Load Dataset ---
‚úÖ Dataset loaded successfully.

Dataset Shape: (7043, 21)

Initial Data Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  Streamin

## üõ†Ô∏è Step 2: Task 2 - Missing Data Handling & Data Type Correction

This step addresses the critical hidden missing values in TotalCharges and corrects its data type. This directly addresses the Missing Data Handling and Data Type Correction subtasks.

In [4]:
# --- 2.1 Data Type Correction & Missing Data Handling ---

print("--- 2. Data Preprocessing: Missing Values & Type Correction ---")

# Convert 'TotalCharges' to numeric, forcing non-numeric values (the empty strings) to NaN
df_clean['TotalCharges'] = pd.to_numeric(df_clean['TotalCharges'], errors='coerce')

# Check for newly exposed NaN values
missing_count = df_clean['TotalCharges'].isnull().sum()
print(f"Number of hidden missing values found: {missing_count}")

# Imputation Strategy:
# Analyze the records with NaN TotalCharges; they all have a tenure of 0.
# A customer with 0 tenure should have 0 total charges.
if missing_count > 0:
    df_clean['TotalCharges'].fillna(0, inplace=True)
    print("Imputation complete: NaN TotalCharges set to 0 as tenure is 0.")

# Verify type change and check all columns
print(f"TotalCharges is now type: {df_clean['TotalCharges'].dtype}")
print(f"Total NaN count across all columns: {df_clean.isnull().sum().sum()}")

print("\n‚úÖ Step 2 Complete: TotalCharges column is fixed.")

--- 2. Data Preprocessing: Missing Values & Type Correction ---
Number of hidden missing values found: 11
Imputation complete: NaN TotalCharges set to 0 as tenure is 0.
TotalCharges is now type: float64
Total NaN count across all columns: 0

‚úÖ Step 2 Complete: TotalCharges column is fixed.


## üóëÔ∏è Step 3: Task 2 - Duplicate Removal and Standardization

This step addresses Duplicate Removal and Formatting & Standardization. We also drop the unique identifier column early.

In [6]:
# --- 3.1 Duplicate Removal ---

print("--- 3. Data Preprocessing: Deduplication & Standardization ---")
initial_rows = df_clean.shape[0]
df_clean.drop_duplicates(inplace=True)

print(f"Removed {initial_rows - df_clean.shape[0]} duplicate rows.")
print(f"Current dataset shape: {df_clean.shape}")

# --- 3.2 Formatting and Standardization (Categorical Labels) ---

# Standardize inconsistent service labels (e.g., 'No internet service' -> 'No')
columns_to_standardize = ['MultipleLines', 'OnlineSecurity', 'OnlineBackup',
                          'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies']

for col in columns_to_standardize:
    df_clean[col] = df_clean[col].replace({'No internet service': 'No', 'No phone service': 'No'})

print("Standardization of 'No service' labels complete.")

# --- 3.3 Outlier Treatment Note ---
print("\nNote on Outlier Treatment: Outliers in numerical features are retained as they represent valid, real-world customer segments.")

# --- 3.4 Drop Identifier ---
# The customerID is a unique identifier and has no predictive power.
df_model = df_clean.drop('customerID', axis=1)

print(f"Identifier column dropped. Modeling dataset created with shape: {df_model.shape}")
print("\n‚úÖ Step 3 Complete: Duplicates checked, labels standardized, and ID dropped.")

--- 3. Data Preprocessing: Deduplication & Standardization ---
Removed 0 duplicate rows.
Current dataset shape: (7043, 21)
Standardization of 'No service' labels complete.

Note on Outlier Treatment: Outliers in numerical features are retained as they represent valid, real-world customer segments.
Identifier column dropped. Modeling dataset created with shape: (7043, 20)

‚úÖ Step 3 Complete: Duplicates checked, labels standardized, and ID dropped.


## üìä Step 4: Task 2 - Categorical Data Encoding

Now we convert the remaining categorical (object) features into numerical data.

The general data preparation pipeline requires converting categorical features into a format machine learning models can understand.

In [7]:
# --- 4.1 Label Encoding (Binary and Target) ---
print("--- 4. Data Preprocessing: Categorical Encoding ---")

# Identify binary columns ('Yes'/'No' or 'Male'/'Female')
binary_cols = [col for col in df_model.select_dtypes(include='object').columns if df_model[col].nunique() == 2]

# 1. Manually encode 'gender' for clear interpretability (Female=0, Male=1)
df_model['gender'] = df_model['gender'].replace({'Female': 0, 'Male': 1})

# 2. Label Encode remaining binary columns (e.g., 'Yes'='1', 'No'='0')
le = LabelEncoder()
for col in binary_cols:
    if col != 'gender':
        df_model[col] = le.fit_transform(df_model[col])

print("Label encoding applied to all binary and target features.")
print(df_model[binary_cols].head(3)) # Check a few encoded columns

# --- 4.2 One-Hot Encoding (Nominal) ---
nominal_cols = ['InternetService', 'Contract', 'PaymentMethod']

# Use get_dummies and drop_first=True to avoid multicollinearity (the Dummy Variable Trap)
df_model = pd.get_dummies(df_model, columns=nominal_cols, drop_first=True)

print(f"\nOne-Hot Encoding applied. New dataset shape: {df_model.shape}")
print(df_model[['InternetService_Fiber optic', 'Contract_One year']].head(3))

print("\n‚úÖ Step 4 Complete: All categorical data is now numerical.")

--- 4. Data Preprocessing: Categorical Encoding ---
Label encoding applied to all binary and target features.
   gender  Partner  Dependents  PhoneService  MultipleLines  OnlineSecurity  \
0       0        1           0             0              0               0   
1       1        0           0             1              0               1   
2       1        0           0             1              0               1   

   OnlineBackup  DeviceProtection  TechSupport  StreamingTV  StreamingMovies  \
0             1                 0            0            0                0   
1             0                 1            0            0                0   
2             1                 0            0            0                0   

   PaperlessBilling  Churn  
0                 1      0  
1                 0      0  
2                 1      1  

One-Hot Encoding applied. New dataset shape: (7043, 24)
   InternetService_Fiber optic  Contract_One year
0                        Fals

## ‚öñÔ∏è Step 5: Task 2 - Feature Scaling / Normalization

The final transformation step is scaling the numerical features so that none dominate the model training process due to large magnitude differences.

In [8]:
# --- 5.1 Feature Scaling (StandardScaler) ---
print("--- 5. Data Preprocessing: Feature Scaling ---")

# Numerical columns to scale (SeniorCitizen is an indicator (0/1) and is typically left unscaled)
numerical_cols_to_scale = ['tenure', 'MonthlyCharges', 'TotalCharges']

# Instantiate and apply StandardScaler (Z-score normalization)
scaler = StandardScaler()
df_model[numerical_cols_to_scale] = scaler.fit_transform(df_model[numerical_cols_to_scale])

print("Standard Scaling applied to numerical features.")

# Check final statistics to confirm scaling (Mean should be ~0, Std Dev ~1)
print("\nScaled Feature Descriptive Statistics:")
print(df_model[numerical_cols_to_scale].describe())

print("\n‚úÖ Step 5 Complete: Numerical features scaled.")

--- 5. Data Preprocessing: Feature Scaling ---
Standard Scaling applied to numerical features.

Scaled Feature Descriptive Statistics:
             tenure  MonthlyCharges  TotalCharges
count  7.043000e+03    7.043000e+03  7.043000e+03
mean  -2.421273e-17   -6.406285e-17 -3.783239e-17
std    1.000071e+00    1.000071e+00  1.000071e+00
min   -1.318165e+00   -1.545860e+00 -1.005780e+00
25%   -9.516817e-01   -9.725399e-01 -8.299464e-01
50%   -1.372744e-01    1.857327e-01 -3.905282e-01
75%    9.214551e-01    8.338335e-01  6.648034e-01
max    1.613701e+00    1.794352e+00  2.825806e+00

‚úÖ Step 5 Complete: Numerical features scaled.


## ‚úîÔ∏è Step 6: Task 3 - Final Verification and Deliverable Summary

We finalize the process by verifying the dataset is fully ready for modeling and summarizing the deliverables.

In [10]:
# --- 6.1 Final Verification ---
print("--- 6. Final Verification and Deliverable Summary ---")

# Check all data types in the final DataFrame
print("\nFinal Data Types Check:")
print(df_model.dtypes.value_counts())

# Ensure no columns are left as 'object'
if df_model.dtypes.value_counts().get('object', 0) == 0:
    print("Success: All columns are now numerical types (int/float).")
else:
    print("Failure: Object columns still remain. Review encoding steps.")

print("\nHead of the Final Model-Ready DataFrame (df_model):")
print(df_model.head())

# --- Deliverable Summary ---
print("\n--- Phase 2 Deliverable Status ---")
print("1. Raw Data (df): Prepared")
print("2. Cleaned/Processed Dataset (df_model): Prepared (Fully Numerical, Scaled, Encoded)")
print("3. Source Code: Prepared (phased steps)")
print("4. Preprocessing Documentation: Documentation Markdown generated based on these steps.")

print("\nüéâ PHASE 2 COMPLETE: Data is fully preprocessed and ready for Phase 3 (Modeling).")

--- 6. Final Verification and Deliverable Summary ---

Final Data Types Check:
int32      12
bool        7
float64     3
int64       2
Name: count, dtype: int64
Success: All columns are now numerical types (int/float).

Head of the Final Model-Ready DataFrame (df_model):
   gender  SeniorCitizen  Partner  Dependents    tenure  PhoneService  \
0       0              0        1           0 -1.277445             0   
1       1              0        0           0  0.066327             1   
2       1              0        0           0 -1.236724             1   
3       1              0        0           0  0.514251             0   
4       0              0        0           0 -1.236724             1   

   MultipleLines  OnlineSecurity  OnlineBackup  DeviceProtection  ...  \
0              0               0             1                 0  ...   
1              0               1             0                 1  ...   
2              0               1             1                 0  ... 