<div style="background-color: #add8e6; padding: 10px; height: 70px; border-radius: 15px;">
    <div style="font-family: 'Georgia', serif; font-size: 20px; padding: 10px; text-align: right; position: absolute; right: 20px;">
        Mohammad Idrees Bhat <br>
        <span style="font-family: 'Arial', sans-serif;font-size: 12px; color: #0a0a0a;">Tech Skills Trainer | AI/ML Consultant</span> <!--- Mohammad Idrees Bhat | Tech Skills Trainer | AI/ML Consultant --->
    </div>
</div>

<!--- Mohammad Idrees Bhat | Tech Skills Trainer | AI/ML Consultant --->

<div class="alert alert-block alert-warning">
    <b><font size="5"> 4. Hands-On Mini-Project: Model Refinement (Optional) </font> </b>
</div>

<!--- Mohammad Idrees Bhat | Tech Skills Trainer | AI/ML Consultant --->

### **Project Title: Customer Churn Prediction Model Refinement Challenge**

**Description**

This project centers on **predicting customer churn**, a popular real-world problem in industries like **telecom**, **SaaS**, or **e-commerce**. The dataset contains information about customer demographics, subscription details, service usage, and customer support interactions. You will build a machine learning model to predict whether a customer will churn (i.e., stop using the service) based on these features.

This project is modern, widely applicable, and gives you insight into solving a business-critical problem.

**Instructions**

1. **Define the Problem**: Predict customer churn using the Telco Customer Churn dataset.  
2. **Data Preprocessing**: Handle missing values, encode categorical features, and scale numerical data.  
3. **Baseline Model**: Train a simple classifier with minimal preprocessing. 
4. **Evaluate Baseline**: Analyze accuracy, recall, and other key metrics.  
Then attempt improve the metrics of the model.  
5. **Feature Engineering**: Create new features, remove redundant ones, and analyze correlations.  
6. **Hyperparameter Tuning**: Optimize model parameters using GridSearchCV or similar tools.  
7. **Model Ensembling**: Use ensemble techniques like Gradient Boosting or Random Forest to enhance performance.  
8. **Final Evaluation**: Compare refined model metrics with baseline results and assess business value.  
9. **Presentation**: Summarize findings, improvements, and rationale for techniques used.  

<div style="background-color: lightblue; color: white; padding: 10px; text-align: center;">
    <h1>_________________________________END________________________________
        <!--- Mohammad Idrees Bhat | Tech Skills Trainer | AI/ML Consultant --->
</h1> </div>

<div style="background-color: lightgreen; color: black; padding: 30px;">
    <h4> Live Exercise Solutions
        
</h4> </div>

**Step 1: Define the Problem**

**Objective**: Predict whether a customer will churn (`Churn = 1`) or not (`Churn = 0`) using the Telco Customer Churn dataset.

- Dataset: Telco Customer Churn (`Telco-Customer-Churn.csv`).
- Target Variable: `Churn`.
- Features: Various customer attributes such as contract type, payment method, and internet service.

In [1]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

Use the line below to download and import the dataset

In [2]:
# pip install kagglehub # if kagglehub is not installed yet
#import kagglehub

# Download latest version
#path = kagglehub.dataset_download("blastchar/telco-customer-churn")

#print("Path to dataset files:", path)

- A link to the dataset : [Telco Customer Churn](https://www.kaggle.com/datasets/blastchar/telco-customer-churn)

**Step 1: Load dataset**

In [3]:
# Load the dataset
# Use the path to get to the dataset
# Rename the dataset and choose the right path
import pandas as pd

data = pd.read_csv('c:\\Users\\devid\\.cache\\kagglehub\\datasets\\blastchar\\telco-customer-churn\\versions\\1\\Telco-Customer-Churn.csv')


In [17]:
data

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,0
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.50,0
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,1
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.30,1840.75,0
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.70,151.65,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7038,6840-RESVB,Male,0,Yes,Yes,24,Yes,Yes,DSL,Yes,...,Yes,Yes,Yes,Yes,One year,Yes,Mailed check,84.80,1990.50,0
7039,2234-XADUH,Female,0,Yes,Yes,72,Yes,Yes,Fiber optic,No,...,Yes,No,Yes,Yes,One year,Yes,Credit card (automatic),103.20,7362.90,0
7040,4801-JZAZL,Female,0,Yes,Yes,11,No,No phone service,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.60,346.45,0
7041,8361-LTMKD,Male,1,Yes,No,4,Yes,Yes,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Mailed check,74.40,306.60,1


**Step 2: Data Preprocessing**

In [4]:
# Minimal preprocessing: Handle missing values and encode target variable
data = data.dropna()  # Drop missing values
data['Churn'] = data['Churn'].apply(lambda x: 1 if x == 'Yes' else 0)  # Encode target variable

# Replace empty strings or whitespace in string columns with NaN and drop them
data.replace(r'^\s*$', pd.NA, regex=True, inplace=True)  # Replace empty strings with NaN
# data.dropna(inplace=True)  # Drop any rows with NaN values

A different way of doing the same thing. 

This is called a **data preprocessing pipeline** using column transformations. It is a structured way to handle and prepare data before feeding it into a machine learning model. 

It is used for:
- Automating Workflow: A preprocessing pipeline **automates repetitive data preparation tasks**, ensuring consistency across training and test datasets.
- Reproducibility: Using a pipeline guarantees that the same preprocessing steps are applied to any new dataset, reducing manual errors.



In [5]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder

# Define column categories
categorical_cols = ['Contract', 'PaymentMethod', 'InternetService', 'MultipleLines']
numerical_cols = ['tenure', 'MonthlyCharges', 'TotalCharges']
binary_cols = ['gender', 'Partner', 'Dependents', 'PhoneService', 
               'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 
               'TechSupport', 'StreamingTV', 'StreamingMovies', 'PaperlessBilling']

# Define preprocessing for binary columns
binary_preprocessor = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),  # Handle missing values
    ('ordinal', OrdinalEncoder())  # Ordinal encode binary columns
])

# Define preprocessing for other categorical columns
categorical_preprocessor = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),  # Handle missing values
    ('onehot', OneHotEncoder(handle_unknown='ignore'))     # One-hot encode
])

# Define preprocessing for numerical columns
numerical_preprocessor = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),           # Handle missing values
    ('scaler', StandardScaler())                           # Scale numerical features
])

# Combine into a single preprocessor
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_preprocessor, numerical_cols),
        ('cat', categorical_preprocessor, categorical_cols),
        ('binary', binary_preprocessor, binary_cols)  # Add binary columns processing
    ],
    remainder='passthrough'  # Leave any remaining columns unchanged
)

**Step 3: Baseline Model**

In [6]:
# Apply to the dataset
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

# Create the model pipeline
model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(max_iter=1000))
])

In [7]:
# Fit the pipeline to your data
# Note: Ensure 'customerID' is removed before fitting
X = data.drop(['Churn', 'customerID'], axis=1)
y = data['Churn']



In [8]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

model.fit(X_train, y_train)

TypeError: float() argument must be a string or a real number, not 'NAType'

In [9]:
print(X_train.isnull().sum())  # Check for missing values in the features
print(y_train.isnull().sum())  # Check for missing values in the target


gender              0
SeniorCitizen       0
Partner             0
Dependents          0
tenure              0
PhoneService        0
MultipleLines       0
InternetService     0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
PaperlessBilling    0
PaymentMethod       0
MonthlyCharges      0
TotalCharges        8
dtype: int64
0


In [10]:
# Convert 'TotalCharges' to numeric, coercing errors into NaN
data['TotalCharges'] = pd.to_numeric(data['TotalCharges'], errors='coerce')

# Now fill missing values with the mean
data['TotalCharges'].fillna(data['TotalCharges'].mean(), inplace=True)

# Double-check to ensure there are no missing values left in 'TotalCharges'
print(data['TotalCharges'].isnull().sum())  # Should print 0 after filling

0


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data['TotalCharges'].fillna(data['TotalCharges'].mean(), inplace=True)


In [11]:
# Fit the pipeline to your data
# Note: Ensure 'customerID' is removed before fitting
X = data.drop(['Churn', 'customerID'], axis=1)

**Step 4: Evaluate Baseline**

In [14]:
# Train-test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train the model
model.fit(X_train, y_train)

# Evaluate the model
from sklearn.metrics import classification_report
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.85      0.90      0.88      1539
           1       0.69      0.57      0.62       574

    accuracy                           0.81      2113
   macro avg       0.77      0.74      0.75      2113
weighted avg       0.81      0.81      0.81      2113



In [16]:
y_train.value_counts()

Churn
0    3635
1    1295
Name: count, dtype: int64

### **Let's now try to improve these results**

**Step 5: Feature Engineering**:

-  **Interaction Features**: Create new features that might capture relationships between existing features. For example, interactions between tenure and MonthlyCharges might reveal insights about customer retention.
- **Polynomial Features**: You can try adding polynomial features for numerical variables. This can help capture non-linear relationships.

In [18]:
from sklearn.preprocessing import PolynomialFeatures

# Add polynomial features for numerical columns
# Create a pipeline to process numerical columns by adding polynomial features
polynomial_preprocessor = Pipeline(steps=[

    # Step 1: Imputer
    # This step handles missing values in the numerical data by replacing them with the mean value of the column.
    # It ensures that the polynomial feature generation doesn't fail due to missing values.
    ('imputer', SimpleImputer(strategy='mean')),

    # Step 2: PolynomialFeatures
    # This step generates polynomial features (up to degree 2 in this case) from the numerical data.
    # It allows the model to capture non-linear relationships by creating interaction terms and squared terms for each feature.
    ('poly', PolynomialFeatures(degree=2, include_bias=False)),

    # Step 3: StandardScaler
    # This step standardizes the features by scaling them to have a mean of 0 and standard deviation of 1.
    # This is important for many machine learning algorithms, as it ensures that all features are on the same scale.
    ('scaler', StandardScaler())
])

In [19]:
# test the results 
# just for demonstration we will use the code again

# Load the dataset
# Use the path to get to the dataset
# Rename the dataset and choose the right path
import pandas as pd

data = pd.read_csv('c:\\Users\\devid\\.cache\\kagglehub\\datasets\\blastchar\\telco-customer-churn\\versions\\1\\Telco-Customer-Churn.csv')

# Minimal preprocessing: Handle missing values and encode target variable
data = data.dropna()  # Drop missing values
data['Churn'] = data['Churn'].apply(lambda x: 1 if x == 'Yes' else 0)  # Encode target variable

# Replace empty strings or whitespace in string columns with NaN and drop them
data.replace(r'^\s*$', pd.NA, regex=True, inplace=True)  # Replace empty strings with NaN
# data.dropna(inplace=True)  # Drop any rows with NaN values






from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler, PolynomialFeatures
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# Define column categories
categorical_cols = ['Contract', 'PaymentMethod', 'InternetService', 'MultipleLines']
numerical_cols = ['tenure', 'MonthlyCharges', 'TotalCharges']
binary_cols = ['gender', 'Partner', 'Dependents', 'PhoneService', 
               'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 
               'TechSupport', 'StreamingTV', 'StreamingMovies', 'PaperlessBilling']

# Define preprocessing for binary columns
binary_preprocessor = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),  # Handle missing values
    ('ordinal', OrdinalEncoder())  # Ordinal encode binary columns
])

# Define preprocessing for categorical columns
categorical_preprocessor = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),  # Handle missing values
    ('onehot', OneHotEncoder(handle_unknown='ignore'))     # One-hot encode
])

# Define preprocessing for numerical columns with polynomial features
polynomial_preprocessor = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),           # Handle missing values
    ('poly', PolynomialFeatures(degree=2, include_bias=False)),  # Generate polynomial features
    ('scaler', StandardScaler())                           # Scale numerical features
])

# Combine into a single preprocessor
preprocessor = ColumnTransformer(
    transformers=[
        ('num', polynomial_preprocessor, numerical_cols),
        ('cat', categorical_preprocessor, categorical_cols),
        ('binary', binary_preprocessor, binary_cols)
    ],
    remainder='passthrough'  # Leave any remaining columns unchanged
)

In [20]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

# Define the full pipeline with Logistic Regression
model_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),  # Preprocessing step
    ('classifier', LogisticRegression(max_iter=1000, class_weight='balanced'))  # Logistic Regression Classifier
])

# Train-test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train the model
model_pipeline.fit(X_train, y_train)

# Evaluate the model
from sklearn.metrics import classification_report

y_pred = model_pipeline.predict(X_test)
print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

           0       0.92      0.73      0.82      1539
           1       0.54      0.83      0.65       574

    accuracy                           0.76      2113
   macro avg       0.73      0.78      0.74      2113
weighted avg       0.82      0.76      0.77      2113



**6. Hyperparameter Tuning:**

Grid Search or Randomized Search: This is a very important step in improving model performance. Grid search helps identify the best hyperparameters for the Logistic Regression model.

In [21]:
# create a parameter grid for logistic regression
param_grid = {
    'classifier__C': [0.1, 1, 10],  # Regularization strength
    'classifier__solver': ['liblinear', 'saga'],  # Solvers to try
    'classifier__penalty': ['l1', 'l2'],  # Regularization type
}

from sklearn.model_selection import GridSearchCV

grid_search = GridSearchCV(model, param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid_search.fit(X_train, y_train)

# Get the best parameters and fit the model
print("Best parameters found: ", grid_search.best_params_)
model = grid_search.best_estimator_


Best parameters found:  {'classifier__C': 10, 'classifier__penalty': 'l1', 'classifier__solver': 'saga'}


In [22]:
# Directly use the best estimator from GridSearchCV
best_model = grid_search.best_estimator_

# Fit the best model (if not already fitted during grid search)
best_model.fit(X_train, y_train)

# Evaluate the model
y_pred = best_model.predict(X_test)

# Display the classification report
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.85      0.90      0.88      1539
           1       0.69      0.57      0.62       574

    accuracy                           0.81      2113
   macro avg       0.77      0.74      0.75      2113
weighted avg       0.81      0.81      0.81      2113



The classification report gives a summary of key metrics for evaluating classification models, including precision, recall, f1-score, and support for each class in the dataset.

| Class   | Precision | Recall | F1-Score | Support |
|---------|-----------|--------|----------|---------|
| **0**   | 0.85      | 0.90   | 0.88     | 1539    |
| **1**   | 0.69      | 0.57   | 0.62     | 574     |
| **Accuracy**   |       |        | 0.81     | 2113    |
| **Macro avg**  | 0.77      | 0.74   | 0.75     | 2113    |
| **Weighted avg** | 0.81      | 0.81   | 0.81     | 2113    |


- For class 0, precision is 0.85, meaning 85% of predicted 0s were actually 0s.
- For class 1, precision is 0.69, meaning 69% of predicted 1s were actually 1s.

- Support refers to the number of occurrences of each class in the dataset.
 - Class 0 has 1539 instances, while class 1 has 574 instances. Which suggests class imbalance.

- Macro average computes the metrics for each class and then averages them. It does not take class imbalance into account.



Let's improve class balance and observe what happens.

By setting `class_weight='balanced'`, you're telling the model to adjust the weights inversely proportional to class frequencies. 

This means that the model will treat misclassifications of the minority class (class 1) as more costly and attempt to improve its performance on that class. 

However, this does not alter the support values in the output, which are based on the actual distribution of the classes in your dataset.

In [23]:
best_model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(max_iter=1000, class_weight='balanced'))
])

param_grid = {
    'classifier__C': [0.1, 1, 10],  # Regularization strength
    'classifier__solver': ['liblinear', 'saga'],  # Solvers to try
    'classifier__penalty': ['l1', 'l2'],  # Regularization type
}

# Perform Grid Search with cross-validation
grid_search = GridSearchCV(best_model, param_grid, cv=5, scoring='accuracy', n_jobs=-1)
# Fit the grid search model
grid_search.fit(X_train, y_train)

# Get the best parameters and fit the model
print("Best parameters found: ", grid_search.best_params_)
best_model2 = grid_search.best_estimator_

Best parameters found:  {'classifier__C': 0.1, 'classifier__penalty': 'l2', 'classifier__solver': 'saga'}


In [24]:
# Directly use the best estimator from GridSearchCV
best_model2 = grid_search.best_estimator_

# Fit the best model (if not already fitted during grid search)
best_model2.fit(X_train, y_train)

# Evaluate the model
y_pred = best_model2.predict(X_test)

# Display the classification report
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.92      0.74      0.82      1539
           1       0.54      0.83      0.65       574

    accuracy                           0.76      2113
   macro avg       0.73      0.78      0.73      2113
weighted avg       0.82      0.76      0.77      2113



Let's Handle Class Imbalance More Effectively with 
SMOTE (Synthetic Minority Over-sampling Technique):

In [25]:
from imblearn.over_sampling import SMOTE
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

# Apply preprocessing to training data
X_train_transformed = preprocessor.fit_transform(X_train)

# Step 1: Apply SMOTE to balance the classes in the training data
smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train_transformed, y_train)

In [26]:
# Class distribution in y_train
import pandas as pd

# If y_train_smote is a NumPy array, convert it to a pandas Series for easier inspection
y_train_series = pd.Series(y_train)

# Get class distribution
class_distribution = y_train_series.value_counts()

print("Class distribution in y_train:")
print(class_distribution)

Class distribution in y_train:
Churn
0    3635
1    1295
Name: count, dtype: int64


In [27]:
# Class distribution in y_train_smote
import pandas as pd

# If y_train_smote is a NumPy array, convert it to a pandas Series for easier inspection
y_train_smote_series = pd.Series(y_train_smote)

# Get class distribution
class_distribution = y_train_smote_series.value_counts()

print("Class distribution in y_train_smote:")
print(class_distribution)

Class distribution in y_train_smote:
Churn
0    3635
1    3635
Name: count, dtype: int64


In [28]:
# Step 2: Define the model with balanced class weights
model = LogisticRegression(max_iter=1000, class_weight='balanced')

# Step 3: Train the model on the SMOTE-balanced data
model.fit(X_train_smote, y_train_smote)

# Step 4: Predict on the transformed test set
X_test_transformed = preprocessor.transform(X_test)
y_pred_smote = model.predict(X_test_transformed)

# Step 5: Evaluate the model and print the classification report to check support values
print(classification_report(y_test, y_pred_smote))

              precision    recall  f1-score   support

           0       0.92      0.74      0.82      1539
           1       0.54      0.82      0.65       574

    accuracy                           0.76      2113
   macro avg       0.73      0.78      0.73      2113
weighted avg       0.81      0.76      0.77      2113



**7. Ensemble Methods:**

- **Random Forest**: A Random Forest classifier or another ensemble model might work better since it can capture more complex patterns.
- **Gradient Boosting**: A boosting method like GradientBoostingClassifier or XGBoost often provides better results by focusing on the mistakes made by previous models.

- **Random Forest Ensemble**

In [29]:
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import classification_report

# Random Forest Classifier
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train_smote, y_train_smote)  # Train with SMOTE data
y_pred_rf = rf_model.predict(X_test_transformed)  # Predict on the transformed test set

# Evaluate Random Forest
print("Random Forest Classifier Report:")
print(classification_report(y_test, y_pred_rf))

Random Forest Classifier Report:
              precision    recall  f1-score   support

           0       0.85      0.85      0.85      1539
           1       0.60      0.60      0.60       574

    accuracy                           0.78      2113
   macro avg       0.72      0.73      0.72      2113
weighted avg       0.78      0.78      0.78      2113



**With reference to our workflow, this is what can be done**

In [None]:
# test the results 
# just for demonstration we will use the code again

# Load the dataset
# Use the path to get to the dataset
# Rename the dataset and choose the right path
import pandas as pd

data = pd.read_csv('c:\\Users\\devid\\.cache\\kagglehub\\datasets\\blastchar\\telco-customer-churn\\versions\\1\\Telco-Customer-Churn.csv')

# Minimal preprocessing: Handle missing values and encode target variable
data = data.dropna()  # Drop missing values
data['Churn'] = data['Churn'].apply(lambda x: 1 if x == 'Yes' else 0)  # Encode target variable

# Replace empty strings or whitespace in string columns with NaN and drop them
data.replace(r'^\s*$', pd.NA, regex=True, inplace=True)  # Replace empty strings with NaN
# data.dropna(inplace=True)  # Drop any rows with NaN values






from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler, PolynomialFeatures
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# Define column categories
categorical_cols = ['Contract', 'PaymentMethod', 'InternetService', 'MultipleLines']
numerical_cols = ['tenure', 'MonthlyCharges', 'TotalCharges']
binary_cols = ['gender', 'Partner', 'Dependents', 'PhoneService', 
               'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 
               'TechSupport', 'StreamingTV', 'StreamingMovies', 'PaperlessBilling']

# Define preprocessing for binary columns
binary_preprocessor = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),  # Handle missing values
    ('ordinal', OrdinalEncoder())  # Ordinal encode binary columns
])

# Define preprocessing for categorical columns
categorical_preprocessor = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),  # Handle missing values
    ('onehot', OneHotEncoder(handle_unknown='ignore'))     # One-hot encode
])

# Define preprocessing for numerical columns with polynomial features
polynomial_preprocessor = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),           # Handle missing values
    ('poly', PolynomialFeatures(degree=2, include_bias=False)),  # Generate polynomial features
    ('scaler', StandardScaler())                           # Scale numerical features
])

# Combine into a single preprocessor
preprocessor = ColumnTransformer(
    transformers=[
        ('num', polynomial_preprocessor, numerical_cols),
        ('cat', categorical_preprocessor, categorical_cols),
        ('binary', binary_preprocessor, binary_cols)
    ],
    remainder='passthrough'  # Leave any remaining columns unchanged
)


In [31]:
from sklearn.ensemble import RandomForestClassifier

# Random Forest Pipeline
rf_model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(random_state=42))
])

# Train-test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train the model
rf_model.fit(X_train, y_train)

# Evaluate the model
from sklearn.metrics import classification_report

y_pred = rf_model.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.82      0.91      0.86      1539
           1       0.65      0.47      0.54       574

    accuracy                           0.79      2113
   macro avg       0.74      0.69      0.70      2113
weighted avg       0.78      0.79      0.78      2113



- **Gradient Boosting Ensemble**

In [32]:
from sklearn.ensemble import GradientBoostingClassifier

# Gradient Boosting Classifier
gb_model = GradientBoostingClassifier(random_state=42)
gb_model.fit(X_train_smote, y_train_smote)  # Train with SMOTE data
y_pred_gb = gb_model.predict(X_test_transformed)  # Predict on the transformed test set

# Evaluate Gradient Boosting
print("\nGradient Boosting Classifier Report:")
print(classification_report(y_test, y_pred_gb))


Gradient Boosting Classifier Report:
              precision    recall  f1-score   support

           0       0.87      0.84      0.85      1539
           1       0.60      0.67      0.64       574

    accuracy                           0.79      2113
   macro avg       0.74      0.75      0.74      2113
weighted avg       0.80      0.79      0.79      2113



**With reference to our workflow, this is what can be done**

In [33]:
from sklearn.ensemble import GradientBoostingClassifier

# Gradient Boosting Pipeline
gb_model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', GradientBoostingClassifier(random_state=42))
])

# Train-test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train the model
gb_model.fit(X_train, y_train)

# Evaluate the model
from sklearn.metrics import classification_report

y_pred = gb_model.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.84      0.91      0.87      1539
           1       0.68      0.53      0.59       574

    accuracy                           0.80      2113
   macro avg       0.76      0.72      0.73      2113
weighted avg       0.79      0.80      0.80      2113



## **This is it. Please repeat this process over another problem and improve on these skills**

<!--- Mohammad Idrees Bhat | Mohammad Idrees Bhat --->

<h2 style="background-color: #ffe4e1; color: #2f4f4f; padding: 10px; border-radius: 10px; width: 350px; text-align: center; float: right; margin: 20px 0;">
    Mohammad Idrees Bhat<br>
    <span style="font-size: 12px; color: #696969;">
        Tech Skills Trainer | AI/ML Consultant
    </span>
</h2>

<!--- Mohammad Idrees Bhat | Tech Skills Trainer | AI/ML Consultant --->