# Preprocessing of the Customer Churn Dataset


## Exploring and Preprocessing the Customer Churn Dataset

The dataset `Customer-Churn.csv` is located at `../2_data/Customer-Churn.csv`. Here is a structured approach to preprocess the data:

1. **Load the Dataset**:
    - Use pandas to load the CSV file into a DataFrame.

2. **Understand the Data**:
    - Display the first few rows of the dataset using `head()`.
    - Use `info()` to get a concise summary of the DataFrame, including the data types and non-null counts.
    - Use `describe()` to get statistical summaries of the numerical columns.

3. **Handle Missing Values**:
    - Identify columns with missing values using `isnull().sum()`.
    - Decide on strategies to handle missing values, such as filling with mean/median/mode or dropping rows/columns.

4. **Convert Data Types**:
    - Convert categorical columns to the 'category' data type.
    - Ensure numerical columns are in the correct format (e.g., integers, floats).

5. **Encode Categorical Variables**:
    - Use one-hot encoding or label encoding for categorical variables to convert them into numerical format.

6. **Feature Engineering**:
    - Create new features if necessary, such as aggregating or transforming existing features.
    - Normalize or standardize numerical features to bring them to a similar scale.

7. **Handle Outliers**:
    - Identify outliers using statistical methods or visualization techniques.
    - Decide on strategies to handle outliers, such as capping or removing them.

8. **Split the Data**:
    - Split the dataset into training and testing sets to evaluate the model performance.

By following these steps, we can ensure that the dataset is clean, well-structured, and ready for further analysis or modeling.


1. **Load the Dataset**:
    - Use pandas to load the CSV file into a DataFrame.

In [4]:
import pandas as pd

# Load the dataset
file_path = '../2_data/Customer-Churn.csv'
df = pd.read_csv(file_path)

2. **Understand the Data**:
    - Display the first few rows of the dataset using `head()`.
    - Use `info()` to get a concise summary of the DataFrame, including the data types and non-null counts.
    - Use `describe()` to get statistical summaries of the numerical columns.

In [6]:
# Display the first few rows of the dataset
print(df.head())

   customerID  gender  SeniorCitizen Partner Dependents  tenure PhoneService  \
0  7590-VHVEG  Female              0     Yes         No       1           No   
1  5575-GNVDE    Male              0      No         No      34          Yes   
2  3668-QPYBK    Male              0      No         No       2          Yes   
3  7795-CFOCW    Male              0      No         No      45           No   
4  9237-HQITU  Female              0      No         No       2          Yes   

      MultipleLines InternetService OnlineSecurity  ... DeviceProtection  \
0  No phone service             DSL             No  ...               No   
1                No             DSL            Yes  ...              Yes   
2                No             DSL            Yes  ...               No   
3  No phone service             DSL            Yes  ...              Yes   
4                No     Fiber optic             No  ...               No   

  TechSupport StreamingTV StreamingMovies        Contract Pape

### Initial Exploration of the Dataset

Upon examining the first few rows of the dataset using the `head()` method, we observe the following:

- **customerID**: Unique identifier for each customer.
- **gender**: Gender of the customer (Male/Female).
- **SeniorCitizen**: Indicates if the customer is a senior citizen (0: No, 1: Yes).
- **Partner**: Indicates if the customer has a partner (Yes/No).
- **Dependents**: Indicates if the customer has dependents (Yes/No).
- **tenure**: Number of months the customer has stayed with the company.
- **PhoneService**: Indicates if the customer has phone service (Yes/No).
- **MultipleLines**: Indicates if the customer has multiple lines (Yes/No/No phone service).
- **InternetService**: Type of internet service (DSL/Fiber optic/No).
- **OnlineSecurity**: Indicates if the customer has online security (Yes/No).
- **OnlineBackup**: Indicates if the customer has online backup (Yes/No).
- **DeviceProtection**: Indicates if the customer has device protection (Yes/No).
- **TechSupport**: Indicates if the customer has tech support (Yes/No).
- **StreamingTV**: Indicates if the customer has streaming TV (Yes/No).
- **StreamingMovies**: Indicates if the customer has streaming movies (Yes/No).
- **Contract**: Type of contract (Month-to-month/One year/Two year).
- **PaperlessBilling**: Indicates if the customer has paperless billing (Yes/No).
- **PaymentMethod**: Payment method used by the customer (Electronic check/Mailed check/Bank transfer (automatic)/Credit card (automatic)).
- **MonthlyCharges**: Monthly charges incurred by the customer.
- **TotalCharges**: Total charges incurred by the customer.
- **Churn**: Indicates if the customer has churned (Yes/No).

### Noteworthy Observations

- **TotalCharges**: This column is of type `object` instead of `float64`, which suggests that there might be some non-numeric values or missing values that need to be handled.
- **SeniorCitizen**: This column is of type `int64` but represents a binary categorical variable, which might be better represented as a category.
- **Churn**: This is the target variable indicating whether the customer has churned, which will be crucial for any predictive modeling.

These observations will guide the preprocessing steps, such as handling missing values, converting data types, and encoding categorical variables.

In [7]:
# Get a concise summary of the DataFrame
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


Key Points for EDA
Data Types and Conversion:

TotalCharges: This column is of type object but likely contains numerical data. Investigate and convert it to a numeric type if appropriate, handling any non-numeric values.
Categorical Variables:

High Cardinality: Columns like customerID have unique values for each entry and might not be useful for analysis.
Binary Categories: Columns like gender, Partner, Dependents, PhoneService, PaperlessBilling, and Churn are binary and can be easily analyzed for distribution and correlation.
Numerical Variables:

SeniorCitizen, tenure, MonthlyCharges: Analyze the distribution, central tendency, and dispersion. Look for outliers and patterns.
Correlation Analysis: Check how these numerical variables correlate with each other and with the target variable Churn.
Missing Values:

Although info() shows no missing values, ensure there are no hidden missing values (e.g., empty strings or placeholders) in columns like TotalCharges.

Distribution Analysis:

Histograms and Box Plots: For numerical columns to understand their distribution and identify outliers.
Bar Plots: For categorical columns to see the frequency distribution of each category.
Relationships and Patterns:

Pair Plots: To visualize relationships between numerical variables.
Group By Analysis: Group data by categorical variables (e.g., Contract, PaymentMethod) to see how they affect numerical variables and Churn.
Target Variable Analysis:

Churn: Analyze the distribution of the target variable Churn to understand the class balance. This is crucial for model building.
Feature Engineering:

New Features: Create new features that might capture more information, such as tenure groups or interaction terms between categorical variables.

In [9]:
# Get statistical summaries of the numerical columns
print(df.describe())

       SeniorCitizen       tenure  MonthlyCharges
count    7043.000000  7043.000000     7043.000000
mean        0.162147    32.371149       64.761692
std         0.368612    24.559481       30.090047
min         0.000000     0.000000       18.250000
25%         0.000000     9.000000       35.500000
50%         0.000000    29.000000       70.350000
75%         0.000000    55.000000       89.850000
max         1.000000    72.000000      118.750000


The majority of the customers are not senior citizens, as indicated by the mean value of 0.162 and the fact that the 25th, 50th, and 75th percentiles are all 0.
Only a small fraction (approximately 16.2%) of the customers are senior citizens.
The tenure of customers varies widely, with a standard deviation of 24.56.
The median tenure is 29 months, indicating that half of the customers have been with the company for less than 29 months.
The range of tenure is from 0 to 72 months, showing a broad distribution of customer loyalty.
The average monthly charge is $64.76, with a standard deviation of $30.09, indicating a wide range of charges.
The median monthly charge is $70.35, suggesting that half of the customers pay less than this amount.
The range of monthly charges is from $18.25 to $118.75, showing significant variability in the pricing.
SeniorCitizen: The most striking observation is the low proportion of senior citizens among the customers.
Tenure: The wide range and high standard deviation in tenure suggest diverse customer retention rates.
MonthlyCharges: The variability in monthly charges indicates a diverse customer base with different service levels or packages.



3. **Handle Missing Values**:
    - Identify columns with missing values using `isnull().sum()`.
    - Decide on strategies to handle missing values, such as filling with mean/median/mode or dropping rows/columns.

In [10]:
# Identify columns with missing values
missing_values = df.isnull().sum()
print(missing_values)

# Since the dataset summary shows no missing values, let's check for any hidden missing values in 'TotalCharges'
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')

# Now check again for missing values after conversion
missing_values_after_conversion = df.isnull().sum()
print(missing_values_after_conversion)

# Handle missing values in 'TotalCharges' by filling with the median
df['TotalCharges'].fillna(df['TotalCharges'].median(), inplace=True)

# Verify that there are no more missing values
final_missing_values = df.isnull().sum()
print(final_missing_values)

customerID          0
gender              0
SeniorCitizen       0
Partner             0
Dependents          0
tenure              0
PhoneService        0
MultipleLines       0
InternetService     0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
PaperlessBilling    0
PaymentMethod       0
MonthlyCharges      0
TotalCharges        0
Churn               0
dtype: int64
customerID           0
gender               0
SeniorCitizen        0
Partner              0
Dependents           0
tenure               0
PhoneService         0
MultipleLines        0
InternetService      0
OnlineSecurity       0
OnlineBackup         0
DeviceProtection     0
TechSupport          0
StreamingTV          0
StreamingMovies      0
Contract             0
PaperlessBilling     0
PaymentMethod        0
MonthlyCharges       0
TotalCharges        11
Churn                0
dtype: int64
customerID          0
gender 

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['TotalCharges'].fillna(df['TotalCharges'].median(), inplace=True)


### Evaluation of Missing Values

The output of the missing values analysis reveals the following:

**Initial Check**:
    - Initially, there were no missing values detected in any of the columns, including `TotalCharges`.

**After Conversion**:
    - After converting the `TotalCharges` column to numeric, 11 missing values were identified. This indicates that there were non-numeric values in the `TotalCharges` column that were converted to `NaN`.

**Final Check**:
    - After handling the missing values in `TotalCharges` by filling them with the median, there are no more missing values in the dataset.

- The initial check did not reveal any missing values, but the conversion of `TotalCharges` to numeric exposed 11 missing values.
- These missing values were successfully handled by filling them with the median value of the `TotalCharges` column.
- The final check confirms that there are no missing values in the dataset, ensuring that the data is now clean and ready for further analysis.

4. **Convert Data Types**:
    - Convert categorical columns to the 'category' data type.
    - Ensure numerical columns are in the correct format (e.g., integers, floats).

In [11]:
# Convert categorical columns to 'category' data type
categorical_columns = ['gender', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines', 'InternetService', 
                       'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 
                       'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod', 'Churn']

for col in categorical_columns:
    df[col] = df[col].astype('category')

# Ensure numerical columns are in the correct format
df['SeniorCitizen'] = df['SeniorCitizen'].astype('int64')
df['tenure'] = df['tenure'].astype('int64')
df['MonthlyCharges'] = df['MonthlyCharges'].astype('float64')
df['TotalCharges'] = df['TotalCharges'].astype('float64')

# Verify the changes
print(df.dtypes)

customerID            object
gender              category
SeniorCitizen          int64
Partner             category
Dependents          category
tenure                 int64
PhoneService        category
MultipleLines       category
InternetService     category
OnlineSecurity      category
OnlineBackup        category
DeviceProtection    category
TechSupport         category
StreamingTV         category
StreamingMovies     category
Contract            category
PaperlessBilling    category
PaymentMethod       category
MonthlyCharges       float64
TotalCharges         float64
Churn               category
dtype: object


5. **Encode Categorical Variables**:
    - Use one-hot encoding or label encoding for categorical variables to convert them into numerical format.

In [12]:
# Use one-hot encoding for categorical variables
df_encoded = pd.get_dummies(df, columns=categorical_columns, drop_first=True)

# Display the first few rows of the encoded DataFrame
print(df_encoded.head())

   customerID  SeniorCitizen  tenure  MonthlyCharges  TotalCharges  \
0  7590-VHVEG              0       1           29.85         29.85   
1  5575-GNVDE              0      34           56.95       1889.50   
2  3668-QPYBK              0       2           53.85        108.15   
3  7795-CFOCW              0      45           42.30       1840.75   
4  9237-HQITU              0       2           70.70        151.65   

   gender_Male  Partner_Yes  Dependents_Yes  PhoneService_Yes  \
0        False         True           False             False   
1         True        False           False              True   
2         True        False           False              True   
3         True        False           False             False   
4        False        False           False              True   

   MultipleLines_No phone service  ...  StreamingTV_Yes  \
0                            True  ...            False   
1                           False  ...            False   
2          

### Conclusions from the Output Data

1. **Customer Tenure and Charges**:
    - The `tenure` of customers varies significantly, ranging from 1 to 45 months in the sample.
    - `MonthlyCharges` and `TotalCharges` also show a wide range, indicating diverse service usage and billing amounts among customers.

2. **Gender Distribution**:
    - The `gender_Male` column indicates the gender of the customers, with both male and female customers represented in the dataset.

3. **Partner and Dependents**:
    - The `Partner_Yes` and `Dependents_Yes` columns show whether customers have partners or dependents. In the sample, some customers have partners, but none have dependents.

4. **Phone and Internet Services**:
    - The `PhoneService_Yes` and `InternetService_Fiber optic` columns indicate whether customers have phone and fiber optic internet services. There is a mix of customers with and without these services.

5. **Streaming Services**:
    - The `StreamingTV_Yes` and `StreamingMovies_Yes` columns show whether customers subscribe to streaming TV and movies services. In the sample, none of the customers have these services.

6. **Contract Types**:
    - The `Contract_One year` and `Contract_Two year` columns indicate the type of contract customers have. The sample includes customers with month-to-month, one-year, and two-year contracts.

7. **Billing and Payment Methods**:
    - The `PaperlessBilling_Yes` column shows whether customers use paperless billing. Some customers use paperless billing, while others do not.
    - The `PaymentMethod_*` columns indicate the payment methods used by customers, including electronic check, mailed check, and credit card.

8. **Churn**:
    - The `Churn_Yes` column indicates whether customers have churned. In the sample, there are both churned and non-churned customers.

### Noteworthy Observations

- **Diverse Customer Profiles**: The dataset includes a diverse range of customer profiles in terms of tenure, charges, services subscribed, and contract types.
- **Churn Analysis**: The presence of both churned and non-churned customers allows for analysis of factors contributing to customer churn.
- **Service Usage**: The variability in service usage (e.g., phone, internet, streaming) can provide insights into customer preferences and potential areas for service improvement.
- **Billing and Payment Preferences**: Understanding billing and payment preferences can help tailor customer service and billing processes to improve customer satisfaction and retention.


6. **Feature Engineering**:
    - Create new features if necessary, such as aggregating or transforming existing features.
    - Normalize or standardize numerical features to bring them to a similar scale.

In [14]:
import numpy as np
from sklearn.preprocessing import StandardScaler

# Create a new feature for tenure groups
df['tenure_group'] = pd.cut(df['tenure'], bins=[0, 12, 24, 48, 60, np.inf], labels=['0-12', '12-24', '24-48', '48-60', '60+'])

# Create a new feature for total charges per month
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')  # Ensure TotalCharges is numeric
df['TotalChargesPerMonth'] = df['TotalCharges'] / df['tenure']
df['TotalChargesPerMonth'].replace([np.inf, -np.inf], np.nan, inplace=True)  # Replace infinite values with NaN
df['TotalChargesPerMonth'].fillna(0, inplace=True)  # Handle division by zero for tenure = 0

# Convert SeniorCitizen to a categorical feature
df['SeniorCitizen'] = df['SeniorCitizen'].astype('category')

# Normalize numerical features
scaler = StandardScaler()
numerical_features = ['tenure', 'MonthlyCharges', 'TotalCharges', 'TotalChargesPerMonth']
df[numerical_features] = scaler.fit_transform(df[numerical_features])

# Verify the changes
print(df.head())

   customerID  gender SeniorCitizen Partner Dependents    tenure PhoneService  \
0  7590-VHVEG  Female             0     Yes         No -1.277445           No   
1  5575-GNVDE    Male             0      No         No  0.066327          Yes   
2  3668-QPYBK    Male             0      No         No -1.236724          Yes   
3  7795-CFOCW    Male             0      No         No  0.514251           No   
4  9237-HQITU  Female             0      No         No -1.236724          Yes   

      MultipleLines InternetService OnlineSecurity  ... StreamingTV  \
0  No phone service             DSL             No  ...          No   
1                No             DSL            Yes  ...          No   
2                No             DSL            Yes  ...          No   
3  No phone service             DSL            Yes  ...          No   
4                No     Fiber optic             No  ...          No   

  StreamingMovies        Contract PaperlessBilling              PaymentMethod  \
0    

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['TotalChargesPerMonth'].replace([np.inf, -np.inf], np.nan, inplace=True)  # Replace infinite values with NaN
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['TotalChargesPerMonth'].fillna(0, inplace=True)  # Handle division by zero for tenure = 0


### Additional Features Created

1. **tenure_group**:
    - **Description**: This feature categorizes the `tenure` of customers into groups: '0-12', '12-24', '24-48', '48-60', and '60+' months.
    - **Reason for Creation**: Grouping the tenure into categories helps in understanding the distribution of customer loyalty over time. It can reveal patterns in customer retention and identify which tenure groups are more likely to churn.

2. **TotalChargesPerMonth**:
    - **Description**: This feature is calculated by dividing the `TotalCharges` by the `tenure`, representing the average charges per month for each customer.
    - **Reason for Creation**: This feature normalizes the total charges over the tenure period, providing a clearer picture of the monthly expenditure of customers. It helps in identifying customers who might be paying more or less on average per month, which can be crucial for understanding customer satisfaction and predicting churn.

### Importance for Analysis/Prediction

- **tenure_group**: By categorizing the tenure, we can perform more granular analysis on customer retention and churn rates across different tenure groups. This can help in identifying specific periods where customers are more likely to churn, allowing for targeted retention strategies.

- **TotalChargesPerMonth**: This feature provides insight into the average monthly spending of customers, which can be a significant factor in predicting churn. Customers with higher average monthly charges might have different satisfaction levels or service usage patterns compared to those with lower charges. This feature can help in segmenting customers based on their spending behavior and tailoring marketing or retention efforts accordingly.

These additional features enhance the dataset by providing more detailed and actionable insights, which are essential for effective analysis and predictive modeling.


7. **Handle Outliers**:
    - Identify outliers using statistical methods or visualization techniques.
    - Decide on strategies to handle outliers, such as capping or removing them.

In [15]:
# Identify outliers using the IQR method
def cap_outliers(df, column):
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    df[column] = np.where(df[column] < lower_bound, lower_bound, df[column])
    df[column] = np.where(df[column] > upper_bound, upper_bound, df[column])

# Apply the capping function to numerical features
for col in numerical_features:
    cap_outliers(df, col)

# Verify the changes
print(df[numerical_features].describe())

             tenure  MonthlyCharges  TotalCharges  TotalChargesPerMonth
count  7.043000e+03    7.043000e+03  7.043000e+03          7.043000e+03
mean  -2.421273e-17   -6.406285e-17 -1.488074e-17          7.465592e-17
std    1.000071e+00    1.000071e+00  1.000071e+00          1.000071e+00
min   -1.318165e+00   -1.545860e+00 -9.991203e-01         -2.137475e+00
25%   -9.516817e-01   -9.725399e-01 -8.298459e-01         -9.597171e-01
50%   -1.372744e-01    1.857327e-01 -3.904632e-01          1.850696e-01
75%    9.214551e-01    8.338335e-01  6.642871e-01          8.416645e-01
max    1.613701e+00    1.794352e+00  2.826743e+00          1.873292e+00



### Handling Outliers

The outlier problem in the dataset was addressed using the Interquartile Range (IQR) method. Here's how the solution works:

1. **Identification of Outliers**:
    - For each numerical feature, the first quartile (Q1) and third quartile (Q3) were calculated.
    - The IQR was computed as the difference between Q3 and Q1.
    - Lower and upper bounds were determined using the formula:
      \[
      \text{Lower Bound} = Q1 - 1.5 \times \text{IQR}
      \]
      \[
      \text{Upper Bound} = Q3 + 1.5 \times \text{IQR}
      \]

2. **Capping Outliers**:
    - Values below the lower bound were capped at the lower bound.
    - Values above the upper bound were capped at the upper bound.

3. **Normalization**:
    - After capping the outliers, the numerical features were normalized using the `StandardScaler` from scikit-learn. This transformation scales the features to have a mean of 0 and a standard deviation of 1.

### Explanation

- **Normalization**: The mean values of the features are close to zero, and the standard deviations are close to one, indicating successful normalization.
- **Capping**: The minimum and maximum values are within a reasonable range, showing that extreme outliers have been capped effectively.
- **Distribution**: The quartiles (25%, 50%, 75%) are well-distributed, suggesting that the data is now more robust for further analysis and modeling.

By capping the outliers and normalizing the data, we ensure that the numerical features are on a similar scale and that extreme values do not disproportionately influence the analysis or predictive models.


8. **Split the Data**:
    - Split the dataset into training and testing sets to evaluate the model performance.

In [None]:
from sklearn.model_selection import train_test_split

# Define the features and target variable
X = df_encoded.drop(columns=['customerID', 'Churn_Yes'])
y = df_encoded['Churn_Yes']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Verify the shapes of the resulting datasets
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")


Splits the dataset into training and testing sets.

The dataset is split using a fixed random state to ensure reproducibility. The test size is set to 0.2, meaning 20% of the data will be used for testing, and the remaining 80% will be used for training. This split ratio is commonly used to provide a balance between having enough data to train the model and having sufficient data to evaluate its performance.

Parameters:
    X (array-like): Features dataset.
    y (array-like): Target variable.

Returns:
    X_train (array-like): Training features.
    X_test (array-like): Testing features.
    y_train (array-like): Training target variable.
    y_test (array-like): Testing target variable.
