<a href="https://colab.research.google.com/github/Zeeshan506/developerhub-task-2-m2-End-to-End-ML-Pipieline/blob/main/Task_2_M_2_End_to_End_ML_Pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ***Task 2: End-to-End ML Pipeline with Scikit-learn Pipeline API***


## ***Installs***

In [1]:
!pip install scikit-learn pandas joblib



## ***Imports***

In [12]:
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
import joblib

## ***Dataset Fetching***
The Telco Churn dataset is widely available online. You can download it directly from a URL into your Colab environment. I've decided to go wiht the Kagglehub way.

In [4]:
from google.colab import files
files.upload() # This will prompt you to upload the kaggle.json file
!mkdir ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

Saving kaggle.json to kaggle.json


In [5]:
import kagglehub
import pandas as pd

# Define the Kaggle dataset reference
dataset_ref = "blastchar/telco-customer-churn"

# Define the specific file to load from the dataset
file_path = "WA_Fn-UseC_-Telco-Customer-Churn.csv"

# Load the dataset using kagglehub
df = kagglehub.load_dataset(
    kagglehub.KaggleDatasetAdapter.PANDAS,
    dataset_ref,
    file_path
)

print("First 5 records:")
print(df.head())

  df = kagglehub.load_dataset(


First 5 records:
   customerID  gender  SeniorCitizen Partner Dependents  tenure PhoneService  \
0  7590-VHVEG  Female              0     Yes         No       1           No   
1  5575-GNVDE    Male              0      No         No      34          Yes   
2  3668-QPYBK    Male              0      No         No       2          Yes   
3  7795-CFOCW    Male              0      No         No      45           No   
4  9237-HQITU  Female              0      No         No       2          Yes   

      MultipleLines InternetService OnlineSecurity  ... DeviceProtection  \
0  No phone service             DSL             No  ...               No   
1                No             DSL            Yes  ...              Yes   
2                No             DSL            Yes  ...               No   
3  No phone service             DSL            Yes  ...              Yes   
4                No     Fiber optic             No  ...               No   

  TechSupport StreamingTV StreamingMovies    

## ***Data Splitting + Data Preprocessing:***
Split the data into features (X) and the target variable (y). It's crucial to split the data before any preprocessing to prevent data leakage.

Identify categorical and numerical features. You will need a ColumnTransformer to apply different transformations to these feature types within the pipeline.

In [22]:
# Separate features (X) and target (y)
X = df.drop(['Churn', 'customerID'], axis=1) # <-- Drop 'customerID' here
y = df['Churn'].apply(lambda x: 1 if x == 'Yes' else 0) # Convert 'Yes'/'No' to 1/0

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

In [23]:
# Identify numerical and categorical features
numerical_features = X.select_dtypes(include=['int64', 'float64']).columns
categorical_features = X.select_dtypes(include=['object']).columns

# You might need to handle the 'TotalCharges' column which is often an object type but should be numeric.
# Convert 'TotalCharges' to numeric, handling potential errors.
X_train['TotalCharges'] = pd.to_numeric(X_train['TotalCharges'], errors='coerce')
X_test['TotalCharges'] = pd.to_numeric(X_test['TotalCharges'], errors='coerce')

# Update numerical features list if needed after correction
numerical_features = ['tenure', 'MonthlyCharges', 'TotalCharges']
categorical_features = [col for col in X.columns if col not in numerical_features]

##  ***Build and Train the Pipeline***

The core of this task is building a scikit-learn Pipeline. This will encapsulate all preprocessing and modeling steps, making the process reproducible and production-ready.

In [24]:
# Create the preprocessing pipelines for numerical and categorical data
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')), # <-- Add this line
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Combine transformers using ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ],
    remainder='passthrough' # Keep any columns not specified
)

### Logistic Regression

In [25]:
pipeline_lr = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(solver='liblinear')) # Using 'liblinear' for better performance on smaller datasets
])

### Random Forest

In [26]:
pipeline_rf = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier())
])

## ***Hyperparameter Tuning with GridSearchCV***
Define the parameter grids for each model and perform GridSearchCV to find the best hyperparameters

In [27]:
# Define the parameter grids for each model
param_grid_lr = {
    'classifier__C': [0.01, 0.1, 1, 10, 100]
}

param_grid_rf = {
    'classifier__n_estimators': [50, 100, 200],
    'classifier__max_depth': [None, 10, 20, 30]
}

# Perform GridSearchCV for Logistic Regression
grid_search_lr = GridSearchCV(pipeline_lr, param_grid_lr, cv=5, n_jobs=-1, scoring='accuracy')
grid_search_lr.fit(X_train, y_train)
print(f"Best parameters for Logistic Regression: {grid_search_lr.best_params_}")
print(f"Best cross-validation accuracy for LR: {grid_search_lr.best_score_}")

# Perform GridSearchCV for Random Forest
grid_search_rf = GridSearchCV(pipeline_rf, param_grid_rf, cv=5, n_jobs=-1, scoring='accuracy')
grid_search_rf.fit(X_train, y_train)
print(f"Best parameters for Random Forest: {grid_search_rf.best_params_}")
print(f"Best cross-validation accuracy for RF: {grid_search_rf.best_score_}")

Best parameters for Logistic Regression: {'classifier__C': 0.01}
Best cross-validation accuracy for LR: 0.80493316795403
Best parameters for Random Forest: {'classifier__max_depth': 10, 'classifier__n_estimators': 200}
Best cross-validation accuracy for RF: 0.8001399524981048


## ***Exporting the pipeline***

In [28]:
# Select the best performing model (Logistic Regression in this case)
best_pipeline = grid_search_lr.best_estimator_

# Export the complete pipeline using joblib
joblib.dump(best_pipeline, 'telco_churn_pipeline.joblib')
print("Best pipeline successfully exported as 'telco_churn_pipeline.joblib'")

Best pipeline successfully exported as 'telco_churn_pipeline.joblib'


## ***Testing***

In [29]:
loaded_pipeline = joblib.load('telco_churn_pipeline.joblib')

In [30]:
new_data = pd.DataFrame({
    'gender': ['Male'],
    'SeniorCitizen': [0],
    'Partner': ['Yes'],
    'Dependents': ['No'],
    'tenure': [50],
    'PhoneService': ['Yes'],
    'MultipleLines': ['Yes'],
    'InternetService': ['Fiber optic'],
    'OnlineSecurity': ['Yes'],
    'OnlineBackup': ['No'],
    'DeviceProtection': ['Yes'],
    'TechSupport': ['Yes'],
    'StreamingTV': ['Yes'],
    'StreamingMovies': ['Yes'],
    'Contract': ['Two year'],
    'PaperlessBilling': ['Yes'],
    'PaymentMethod': ['Credit card (automatic)'],
    'MonthlyCharges': [100.5],
    'TotalCharges': [5000.0]
})

In [31]:
# Use the loaded pipeline to predict churn for the new data
churn_prediction = loaded_pipeline.predict(new_data)

# You can also get probability scores
churn_probabilities = loaded_pipeline.predict_proba(new_data)

print(f"Prediction (0 = No Churn, 1 = Churn): {churn_prediction[0]}")
print(f"Probabilities (No Churn, Churn): {churn_probabilities[0]}")

Prediction (0 = No Churn, 1 = Churn): 0
Probabilities (No Churn, Churn): [0.87413074 0.12586926]
