# Step 7: Split the Data

## Preprocessing Pipeline Overview

This preprocessing pipeline outlines the steps necessary to prepare the Telco Customer Churn dataset for our modeling. Each step is designed to address specific aspects of data quality, transformation, and feature creation. We will cover each step in a separate jupyter notebook file.

**Step 1: Data Loading**: Loading the datasets into the workspace, ensuring all necessary files are correctly imported for analysis. This includes the Kaggle dataset and the IBM datasets.

**Step 2: Dataset Integration**: Combining relevant datasets into a single, unified dataset that will serve as the foundation for subsequent analysis.

**Step 3: Handling Missing Values**: Identifying and addressing missing values in the dataset to ensure data integrity. This step ensures no significant gaps hinder the analysis.

**Step 4: Data Type Conversion**: Converting data columns to appropriate data types to optimize memory usage and prepare for feature engineering. Ensure consistency across all columns.

**Step 5: Data Exploration**: Perform initial exploratory data analysis (EDA) to understand the dataset's structure and characteristics, visualizing key features to gain insights into the data.

**Step 6: Feature Engineering**: Creating new features from the existing data to enhance model performance and capture additional insights. This includes transformations and derived features.

**Step 7: Dataset Splitting**: Splitting the dataset into training and testing subsets to prepare for model development and evaluation. This step ensures reproducibility and robust performance metrics.

**Step 8: Outlier Detection**: Identifying and addressing outliers in the dataset to ensure they do not negatively impact the analysis or models.

**Step 9: Clustering Customers**: Identifying the most common customer profiles via clustering.

Methods to Split the Dataset

When working with a dataset of 7034 entries, it is crucial to split the data into training and testing subsets to evaluate the models performance accurately. Here are some common methods to split the dataset:

1. **Train-Test Split**:
    - This is the most straightforward method where the dataset is divided into two parts: training and testing sets.
    - Typically, 70-80% of the data is used for training, and the remaining 20-30% is used for testing.

2. **Stratified Shuffle Split**:
    - This method ensures that the training and testing sets have the same proportion of class labels as the original dataset.
    - It is particularly useful for imbalanced datasets.

3. **K-Fold Cross-Validation**:
    - This method splits the dataset into `k` equal-sized folds. The model is trained on `k-1` folds and tested on the remaining fold.
    - This process is repeated `k` times, with each fold used exactly once as the test set.

4. **Stratified K-Fold Cross-Validation**:
    - Similar to K-Fold Cross-Validation but ensures that each fold has the same proportion of class labels.

Each of these methods has its advantages and can be chosen based on the specific requirements of the analysis and the characteristics of the dataset.


We are going to drop the 'Churn', 'Customer Status_Churned', and 'Churn Value' columns from the dataset.
The last two columns are encodings of the 'Churn' column, so all three columns are removed to avoid redundancy.


In [6]:
from sklearn.model_selection import StratifiedShuffleSplit
import pandas as pd

df = pd.read_csv("../2_data/telcocustomerchurn_featured.csv")
print(df.columns.tolist())
print(df.dtypes)

['Unnamed: 0', 'Count', 'Gender', 'Age', 'Under 30', 'Senior Citizen', 'Married', 'Dependents', 'Number of Dependents', 'City', 'Zip Code', 'Latitude', 'Longitude', 'Referred a Friend', 'Number of Referrals', 'Tenure in Months', 'Phone Service', 'Avg Monthly Long Distance Charges', 'Multiple Lines', 'Internet Service', 'Avg Monthly GB Download', 'Online Security', 'Online Backup', 'Device Protection Plan', 'Premium Tech Support', 'Streaming TV', 'Streaming Movies', 'Streaming Music', 'Unlimited Data', 'Paperless Billing', 'Monthly Charge', 'Total Charges', 'Total Refunds', 'Total Extra Data Charges', 'Total Long Distance Charges', 'Total Revenue', 'Satisfaction Score', 'Churn Value', 'Churn Score', 'CLTV', 'LoyaltyID', 'Partner', 'Tenure', 'Monthly Charges', 'Churn', 'Country_United States', 'State_California', 'Quarter_Q3', 'Offer_Offer A', 'Offer_Offer B', 'Offer_Offer C', 'Offer_Offer D', 'Offer_Offer E', 'Internet Type_Cable', 'Internet Type_DSL', 'Internet Type_Fiber Optic', 'Cont

### Deleting Columns encoding information about Churn

We delete all columns that include "churn" in their name because these features are highly likely to introduce data leakage. 
Data leakage occurs when information from outside the training dataset is used to create the model, leading to overly optimistic performance estimates. 
In this case, columns such as 'Churn Value', 'Churn Score', 'Churn', 'Customer Status_Churned', and various 'Churn Category' and 'Churn Reason' columns 
are either explicitly tied to the target variable (churn) or derived from it. Including these columns in the model would provide it with information 
that would not be available in a real-world scenario, thus compromising the model's ability to generalize to new, unseen data.

In [7]:
churn_columns = [col for col in df.columns if 'Churn' in col]
print(churn_columns)

['Churn Value', 'Churn Score', 'Churn', 'Customer Status_Churned', 'Churn Category_Attitude', 'Churn Category_Competitor', 'Churn Category_Dissatisfaction', 'Churn Category_Other', 'Churn Category_Price', 'Churn Reason_Attitude of service provider', 'Churn Reason_Attitude of support person', 'Churn Reason_Competitor had better devices', 'Churn Reason_Competitor made better offer', 'Churn Reason_Competitor offered higher download speeds', 'Churn Reason_Competitor offered more data', 'Churn Reason_Deceased', "Churn Reason_Don't know", 'Churn Reason_Extra data charges', 'Churn Reason_Lack of affordable download/upload speed', 'Churn Reason_Lack of self-service on Website', 'Churn Reason_Limited range of services', 'Churn Reason_Long distance charges', 'Churn Reason_Moved', 'Churn Reason_Network reliability', 'Churn Reason_Poor expertise of online support', 'Churn Reason_Poor expertise of phone support', 'Churn Reason_Price too high', 'Churn Reason_Product dissatisfaction', 'Churn Reason_S

In [8]:
# Drop all columns containing "Churn" in their name except the "Churn" column
churn_columns_to_drop = [col for col in churn_columns if col != 'Churn']
df = df.drop(columns=churn_columns_to_drop)

# Display the remaining columns
print(df.columns.tolist())

['Unnamed: 0', 'Count', 'Gender', 'Age', 'Under 30', 'Senior Citizen', 'Married', 'Dependents', 'Number of Dependents', 'City', 'Zip Code', 'Latitude', 'Longitude', 'Referred a Friend', 'Number of Referrals', 'Tenure in Months', 'Phone Service', 'Avg Monthly Long Distance Charges', 'Multiple Lines', 'Internet Service', 'Avg Monthly GB Download', 'Online Security', 'Online Backup', 'Device Protection Plan', 'Premium Tech Support', 'Streaming TV', 'Streaming Movies', 'Streaming Music', 'Unlimited Data', 'Paperless Billing', 'Monthly Charge', 'Total Charges', 'Total Refunds', 'Total Extra Data Charges', 'Total Long Distance Charges', 'Total Revenue', 'Satisfaction Score', 'CLTV', 'LoyaltyID', 'Partner', 'Tenure', 'Monthly Charges', 'Churn', 'Country_United States', 'State_California', 'Quarter_Q3', 'Offer_Offer A', 'Offer_Offer B', 'Offer_Offer C', 'Offer_Offer D', 'Offer_Offer E', 'Internet Type_Cable', 'Internet Type_DSL', 'Internet Type_Fiber Optic', 'Contract_Month-to-Month', 'Contrac

We will drop the `unnamed_0` column, which is an additional unique identifier that was added and is not needed. Furthermore, we will drop the `Customer Status` columns, since they encode the Churn as well. Furthermore, since the 'LoyaltyID' is a numerical value encoding the Customer Loyalty (not churning), we will delete this as well.

In [9]:
# Define the features and target variable
X = df.drop(columns=['Churn', 'Unnamed: 0', 'Customer Status_Joined', 'Customer Status_Stayed', 'LoyaltyID'])
y = df['Churn']

# StratifiedShuffleSplit 
sss = StratifiedShuffleSplit(n_splits=5, test_size=0.2, random_state=7)
print(sss)

for train_index, test_index in sss.split(X, y):
    print("train:", train_index, "test:", test_index)
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

print(f"X shape: {X.shape}")
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y shape: {y.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")

# Save the train and test splits to CSV files
X_train.to_csv("../2_data/X_train.csv", index=False)
X_test.to_csv("../2_data/X_test.csv", index=False)
y_train.to_csv("../2_data/y_train.csv", index=False)
y_test.to_csv("../2_data/y_test.csv", index=False)

StratifiedShuffleSplit(n_splits=5, random_state=7, test_size=0.2,
            train_size=None)
train: [4486 1855 3910 ... 4687  690 5658] test: [ 724  945 4710 ... 3730 1633 2496]
train: [ 937 6833 3567 ... 4257 4716 5657] test: [6171  505 5903 ... 6911 2257 5571]
train: [4793 2570 3406 ...  805 2103 1162] test: [  31 2128 2079 ... 2313 6356 5365]
train: [6799  180 3061 ... 3155 2428  793] test: [3083 4441  980 ... 4552 2903 1479]
train: [ 603 4034 6970 ... 1356 4927 1489] test: [5931 4868 2690 ... 3344 6206 5245]
X shape: (7043, 91)
X_train shape: (5634, 91)
X_test shape: (1409, 91)
y shape: (7043,)
y_train shape: (5634,)
y_test shape: (1409,)


In [10]:
# Print the columns of the dataframe
print(X_train.columns.tolist())

['Count', 'Gender', 'Age', 'Under 30', 'Senior Citizen', 'Married', 'Dependents', 'Number of Dependents', 'City', 'Zip Code', 'Latitude', 'Longitude', 'Referred a Friend', 'Number of Referrals', 'Tenure in Months', 'Phone Service', 'Avg Monthly Long Distance Charges', 'Multiple Lines', 'Internet Service', 'Avg Monthly GB Download', 'Online Security', 'Online Backup', 'Device Protection Plan', 'Premium Tech Support', 'Streaming TV', 'Streaming Movies', 'Streaming Music', 'Unlimited Data', 'Paperless Billing', 'Monthly Charge', 'Total Charges', 'Total Refunds', 'Total Extra Data Charges', 'Total Long Distance Charges', 'Total Revenue', 'Satisfaction Score', 'CLTV', 'Partner', 'Tenure', 'Monthly Charges', 'Country_United States', 'State_California', 'Quarter_Q3', 'Offer_Offer A', 'Offer_Offer B', 'Offer_Offer C', 'Offer_Offer D', 'Offer_Offer E', 'Internet Type_Cable', 'Internet Type_DSL', 'Internet Type_Fiber Optic', 'Contract_Month-to-Month', 'Contract_One Year', 'Contract_Two Year', 'P


### Cross-Validation Function

By defining a cross-validation function, we can streamline the process of evaluating different models and methods, ensuring reproducibility and consistency in your results. This function can be applied to various machine learning algorithms and dataset splits, allowing you to assess their performance using the same cross-validation strategy.
We will refer to the cross validation in each method evaluation section.