# Step 1: Data Loading

## Preprocessing Pipeline Overview

This preprocessing pipeline outlines the steps necessary to prepare the Telco Customer Churn dataset for our modeling. Each step is designed to address specific aspects of data quality, transformation, and feature creation. We will cover each step in a separate jupyter notebook file.

**Step 1: Data Loading**: Loading the datasets into the workspace, ensuring all necessary files are correctly imported for analysis. This includes the Kaggle dataset and the IBM datasets.

**Step 2: Dataset Integration**: Combining relevant datasets into a single, unified dataset that will serve as the foundation for subsequent analysis.

**Step 3: Handling Missing Values**: Identifying and addressing missing values in the dataset to ensure data integrity. This step ensures no significant gaps hinder the analysis.

**Step 4: Data Type Conversion**: Converting data columns to appropriate data types to optimize memory usage and prepare for feature engineering. Ensure consistency across all columns.

**Step 5: Data Exploration**: Perform initial exploratory data analysis (EDA) to understand the dataset's structure and characteristics, visualizing key features to gain insights into the data.

**Step 6: Feature Engineering**: Creating new features from the existing data to enhance model performance and capture additional insights. This includes transformations and derived features.

**Step 7: Outlier Detection**: Identifying and addressing outliers in the dataset to ensure they do not negatively impact the analysis or models.

**Step 8: Dataset Splitting**: Splitting the dataset into training and testing subsets to prepare for model development and evaluation. This step ensures reproducibility and robust performance metrics.

In [1]:
import pandas as pd
import os

## Data Sources

In this analysis, we utilize two primary datasets:

1. **Kaggle Dataset**: This dataset provides customer information from a telecommunications company, including demographics, account details, and service usage.

2. **IBM Dataset**: This dataset offers more comprehensive and detailed data, including additional aspects such as demographics, location, population, services, and customer status. Furthermore, an already combined version exists (`telco_customer_churn.xlsx`).

By integrating these datasets, we get more detailed customer information for our analysis in the next steps of the preprocessing pipeline.

In [2]:
# Layout of the data
pd.set_option('display.max_columns', None)  # Show all columns
pd.set_option('display.expand_frame_repr', False)  # Prevent line wrapping

## 1.1 Original Kaggle Dataset

### General Information on the Kaggle Dataset

The `df_original` dataframe contains customer information from a telecommunications company. It includes 21 columns with details such as customer ID, gender, senior citizen status, partner status, dependents, tenure, phone service, multiple lines, internet service, online security, online backup, device protection, tech support, streaming TV, streaming movies, contract type, paperless billing, payment method, monthly charges, total charges, and churn status. This dataset is sourced from [Kaggle](https://www.kaggle.com/datasets/blastchar/telco-customer-churn).

In [3]:
# Load the dataset
file_path = '../2_data/original_data/Customer-Churn.csv'
df_original = pd.read_csv(file_path)

df_original.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,Yes,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,No,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,Yes,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,No,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,No,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [4]:
# Display a concise summary of the DataFrame
df_original.info()

# Get statistical summaries of the numerical columns
df_original.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges
count,7043.0,7043.0,7043.0
mean,0.162147,32.371149,64.761692
std,0.368612,24.559481,30.090047
min,0.0,0.0,18.25
25%,0.0,9.0,35.5
50%,0.0,29.0,70.35
75%,0.0,55.0,89.85
max,1.0,72.0,118.75


### Analysis of the Original Kaggle Dataset

#### Key Insights:

1. **Customer Demographics**:
    - **Gender**: The dataset includes both male and female customers.
    - **Senior Citizens**: Approximately 16.2% of the customers are senior citizens.

2. **Account Information**:
    - **Partner and Dependents**: A significant portion of customers have partners, but fewer have dependents.
    - **Contract Types**: Customers have various contract types, including month-to-month, one-year, and two-year contracts.
    - **Payment Methods**: Customers use different payment methods, such as electronic check, mailed check, bank transfer, and credit card.

3. **Service Usage**:
    - **Phone and Internet Services**: The dataset includes customers with and without phone service, multiple lines, and different types of internet services (DSL, Fiber optic).
    - **Additional Services**: Customers subscribe to various additional services like online security, online backup, device protection, tech support, streaming TV, and streaming movies.

4. **Charges**:
    - **Monthly Charges**: The average monthly charge is approximately $64.76, with a wide range from $18.25 to $118.75.
    - **Total Charges**: The total charges vary significantly among customers, reflecting different service usage and tenure.

5. **Customer Tenure**:
    - **Tenure**: The tenure of customers ranges from 0 to 72 months, with a median of 29 months. This indicates a diverse range of customer loyalty.

6. **Churn**:
    - **Churn Rate**: The dataset includes information on whether customers have churned, which is crucial for predictive modeling and understanding customer retention.

### Summary Statistics:

- **SeniorCitizen**:
  - Mean: 0.162
  - Standard Deviation: 0.369
  - Min: 0
  - Max: 1

- **Tenure**:
  - Mean: 32.37 months
  - Standard Deviation: 24.56 months
  - Min: 0 months
  - Max: 72 months

- **MonthlyCharges**:
  - Mean: $64.76
  - Standard Deviation: $30.09
  - Min: $18.25
  - Max: $118.75

These insights provide a comprehensive overview of the customer demographics, account information, service usage, charges, tenure, and churn, which are essential for further analysis and modeling.

### Conclusion on the Original Kaggle Dataset and next steps

There is a newer version of the dataset available on the IBM website, which provides more comprehensive and detailed data. We will focus on this dataset for our analysis and modeling.

## 1.2 IBM Dataset

### General Information on the IBM Dataset

For more comprehensive data, we found additional datasets in a newer version available on the corresponding [IBM website](https://community.ibm.com/community/user/businessanalytics/blogs/steven-macko/2019/07/11/telco-customer-churn-1113). These datasets provide detailed information on various aspects such as demographics, location, population, services, and customer status. By integrating these datasets, we can enrich our analysis with more granular and comprehensive customer information.

The `(../2_data/Telecommunications_Industry)` folder contains various Excel files sourced from the IBM Website. Each file within this folder pertains to a specific topic relevant to the telecommunications industry. Below is a brief description of each topic covered by these files:

**Telco_customer_churn_demographics.xlsx**

This file contains demographic information about the customers, such as age, gender, and marital status. This data is useful for understanding the customer base and performing market segmentation.


**Telco_customer_churn_location.xlsx**

This file includes data on the geographical location of customers, including their state, city, and zip code. It helps in analyzing the distribution of customers across different regions.

**Telco_customer_churn_population.xlsx**

This file provides details on the population statistics related to the customers, such as population density and other relevant metrics. It helps in understanding the broader demographic context of the customer base.

**Telco_customer_churn_services.xlsx**

This file includes data on the various services subscribed to by customers, such as phone service, internet service, and additional features like online security and tech support. It helps in analyzing service usage patterns and customer preferences.

**Telco_customer_churn_status.xlsx**

This file captures the status of customers, including their tenure, contract type, payment method, and churn status. It is essential for understanding customer retention and identifying factors contributing to churn.

Each of these files provides valuable insights into different aspects of the telecommunications industry, enabling data-driven decision-making and strategic planning.


## 1.2.1 IBM Demographics Dataset

In [5]:
# Load the demographics dataset
demographics_file_path = '../2_data/original_data/Telecommunications_Industry/Telco_customer_churn_demographics.xlsx'
csv_file_path = '../2_data/Telco_customer_churn_demographics.csv'
df_demographics = pd.read_excel(demographics_file_path)

# Save the demographics dataset to a CSV file
df_demographics.to_csv(csv_file_path)

# Load the demographics dataset from the CSV file
df_demographics = pd.read_csv(csv_file_path)

# Display the first few rows of the demographics dataset
df_demographics.head()

Unnamed: 0.1,Unnamed: 0,Customer ID,Count,Gender,Age,Under 30,Senior Citizen,Married,Dependents,Number of Dependents
0,0,8779-QRDMV,1,Male,78,No,Yes,No,No,0
1,1,7495-OOKFY,1,Female,74,No,Yes,Yes,Yes,1
2,2,1658-BYGOY,1,Male,71,No,Yes,No,Yes,3
3,3,4598-XLKNJ,1,Female,78,No,Yes,Yes,Yes,1
4,4,4846-WHAFZ,1,Female,80,No,Yes,Yes,Yes,1


### General Information on the IBM Demographics Dataset

The `telco_customer_churn_demographics.xlsx` dataset, abbreviated with `df_demographics` in our further analysis, provides detailed demographic information about the customers of a telecommunications company. This dataset includes the following key features.

In [6]:
# Display a concise summary of the demographics DataFrame
df_demographics.info()

# Get statistical summaries of the numerical columns in the demographics DataFrame
df_demographics.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 10 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   Unnamed: 0            7043 non-null   int64 
 1   Customer ID           7043 non-null   object
 2   Count                 7043 non-null   int64 
 3   Gender                7043 non-null   object
 4   Age                   7043 non-null   int64 
 5   Under 30              7043 non-null   object
 6   Senior Citizen        7043 non-null   object
 7   Married               7043 non-null   object
 8   Dependents            7043 non-null   object
 9   Number of Dependents  7043 non-null   int64 
dtypes: int64(4), object(6)
memory usage: 550.4+ KB


Unnamed: 0.1,Unnamed: 0,Count,Age,Number of Dependents
count,7043.0,7043.0,7043.0,7043.0
mean,3521.0,1.0,46.509726,0.468692
std,2033.283305,0.0,16.750352,0.962802
min,0.0,1.0,19.0,0.0
25%,1760.5,1.0,32.0,0.0
50%,3521.0,1.0,46.0,0.0
75%,5281.5,1.0,60.0,0.0
max,7042.0,1.0,80.0,9.0


### Key Insights from Demographics Data

Our goal is to get a general understanding of this dataset. We will analyze it in detail later.

1. **Gender Distribution**:
    - The dataset includes both male and female customers, providing a balanced view of the customer base.

2. **Age Distribution**:
    - The average age of customers is approximately 46.5 years.
    - The age range spans from 19 to 80 years, indicating a diverse age group among the customers.
    - The standard deviation of 16.75 years suggests a wide variation in customer ages.

3. **Senior Citizens**:
    - The dataset includes information on whether customers are senior citizens (65 years or older). This can help in understanding the needs and preferences of older customers.

4. **Marital Status**:
    - The dataset captures whether customers are married, which can be useful for segmenting customers based on their family status.

5. **Dependents**:
    - Information on whether customers have dependents and the number of dependents they have is available.
    - The average number of dependents is approximately 0.47, with a maximum of 9 dependents.
    - A significant portion of customers do not have any dependents, as indicated by the 25th, 50th, and 75th percentiles all being 0.

6. **Customer Count**:
    - The `Count` column is used for reporting and dashboarding purposes, indicating the number of customers in a filtered set.

These insights provide a comprehensive overview of the customer demographics, which can be crucial for targeted marketing, customer segmentation, and personalized service offerings.

## 1.2.2 IBM Location Dataset

### General Information on the IBM Location Dataset

The `telco_customer_churn_location.xlsx` dataset, abbreviated with `df_location` in our further analysis, provides detailed location information about the customers of a telecommunications company. This data helps us understand the geographical distribution of customers, which can be crucial for regional marketing strategies, service optimization, and identifying location-specific customer needs. By analyzing the location data, we can gain insights into customer concentration in different areas, regional preferences, and potential areas for expanding services or improving customer support.

In [7]:
# Load the location dataset
location_file_path = '../2_data/original_data/Telecommunications_Industry/Telco_customer_churn_location.xlsx'
csv_file_path = '../2_data/Telco_customer_churn_location.csv'
df_location = pd.read_excel(location_file_path)

# Save the location dataset to a CSV file
df_location.to_csv(csv_file_path)

# Load the location dataset from the CSV file
df_location = pd.read_csv(csv_file_path)

# Display the first few rows of the location dataset
df_location.head()

Unnamed: 0.1,Unnamed: 0,Location ID,Customer ID,Count,Country,State,City,Zip Code,Lat Long,Latitude,Longitude
0,0,OXCZEW7397,8779-QRDMV,1,United States,California,Los Angeles,90022,"34.02381, -118.156582",34.02381,-118.156582
1,1,FCCECI8494,7495-OOKFY,1,United States,California,Los Angeles,90063,"34.044271, -118.185237",34.044271,-118.185237
2,2,HEHUQY7254,1658-BYGOY,1,United States,California,Los Angeles,90065,"34.108833, -118.229715",34.108833,-118.229715
3,3,WIUHRF2613,4598-XLKNJ,1,United States,California,Inglewood,90303,"33.936291, -118.332639",33.936291,-118.332639
4,4,CFEZBF4415,4846-WHAFZ,1,United States,California,Whittier,90602,"33.972119, -118.020188",33.972119,-118.020188


In [8]:
# Display a concise summary of the demographics DataFrame
df_location.info()

# Get statistical summaries of the numerical columns in the demographics DataFrame
df_location.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Unnamed: 0   7043 non-null   int64  
 1   Location ID  7043 non-null   object 
 2   Customer ID  7043 non-null   object 
 3   Count        7043 non-null   int64  
 4   Country      7043 non-null   object 
 5   State        7043 non-null   object 
 6   City         7043 non-null   object 
 7   Zip Code     7043 non-null   int64  
 8   Lat Long     7043 non-null   object 
 9   Latitude     7043 non-null   float64
 10  Longitude    7043 non-null   float64
dtypes: float64(2), int64(3), object(6)
memory usage: 605.4+ KB


Unnamed: 0.1,Unnamed: 0,Count,Zip Code,Latitude,Longitude
count,7043.0,7043.0,7043.0,7043.0,7043.0
mean,3521.0,1.0,93486.071134,36.197455,-119.756684
std,2033.283305,0.0,1856.768045,2.468929,2.154425
min,0.0,1.0,90001.0,32.555828,-124.301372
25%,1760.5,1.0,92101.0,33.990646,-121.78809
50%,3521.0,1.0,93518.0,36.205465,-119.595293
75%,5281.5,1.0,95329.0,38.161321,-117.969795
max,7042.0,1.0,96150.0,41.962127,-114.192901


### Key Insights from Location Data

1. **Geographical Distribution**:
    - The dataset covers a wide range of zip codes, indicating a diverse geographical distribution of customers.
    - The latitude and longitude values suggest that customers are spread across various regions, with significant variation in their locations.

2. **Customer Concentration**:
    - The mean and median zip codes are close to each other, suggesting a relatively even distribution of customers across different areas.
    - The standard deviation of the zip codes indicates some variability, but not extreme, in the customer locations.

3. **State and City Representation**:
    - The dataset includes customers from multiple states and cities, providing a comprehensive view of the customer base across different regions.
    - This diversity can be useful for regional analysis and understanding location-specific trends.

4. **Latitude and Longitude Range**:
    - The range of latitude and longitude values shows that customers are located from the southernmost to the northernmost parts of the region, as well as from the easternmost to the westernmost parts.
    - This wide range can help in analyzing regional differences in customer behavior and preferences.

5. **Data Completeness**:
    - All columns have complete data with no missing values, ensuring the reliability of the location information for further analysis.

These insights highlight the geographical diversity and distribution of the customer base, which can be crucial for regional marketing strategies, service optimization, and understanding location-specific customer needs.

## 1.2.3 IBM Population Dataset

### General Information on the IBM Location Dataset

The `telco_customer_churn_population.xlsx` dataset, abbreviated with `df_population` in our further analysis, provides detailed population information related to the customers of a telecommunications company. This dataset helps in understanding the broader demographic context of the customer base, which can be useful for regional analysis and identifying trends based on population density.

In [9]:
# Load the population dataset
population_file_path = '../2_data/original_data/Telecommunications_Industry/Telco_customer_churn_population.xlsx'
csv_file_path = '../2_data/Telco_customer_churn_population.csv'
df_population = pd.read_excel(population_file_path)

# Save the population dataset to a CSV file
df_population.to_csv(csv_file_path)

# Load the population dataset from the CSV file
df_population = pd.read_csv(csv_file_path)

# Display the first few rows of the population dataset
df_population.head()

Unnamed: 0.1,Unnamed: 0,ID,Zip Code,Population
0,0,1,90001,54492
1,1,2,90002,44586
2,2,3,90003,58198
3,3,4,90004,67852
4,4,5,90005,43019


In [10]:
# Display a concise summary of the demographics DataFrame
df_population.info()

# Get statistical summaries of the numerical columns in the demographics DataFrame
df_population.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1671 entries, 0 to 1670
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype
---  ------      --------------  -----
 0   Unnamed: 0  1671 non-null   int64
 1   ID          1671 non-null   int64
 2   Zip Code    1671 non-null   int64
 3   Population  1671 non-null   int64
dtypes: int64(4)
memory usage: 52.3 KB


Unnamed: 0.1,Unnamed: 0,ID,Zip Code,Population
count,1671.0,1671.0,1671.0,1671.0
mean,835.0,836.0,93678.99222,20276.384201
std,482.520466,482.520466,1817.763591,20689.1173
min,0.0,1.0,90001.0,11.0
25%,417.5,418.5,92269.0,1789.0
50%,835.0,836.0,93664.0,14239.0
75%,1252.5,1253.5,95408.0,32942.5
max,1670.0,1671.0,96161.0,105285.0


### Key Insights from Population Data

1. **Population Distribution**:
    - The population estimates for the zip code areas vary significantly, with a mean population of approximately 20,276. This indicates a diverse range of population densities across different regions.
    - The standard deviation of 20,689 suggests a wide variation in population sizes, with some areas being densely populated while others are sparsely populated.

2. **Population Range**:
    - The minimum population estimate is 11, indicating very sparsely populated areas.
    - The maximum population estimate is 105,285, showing that some zip code areas are highly populated.
    - The interquartile range (IQR) from 1,789 to 32,942.5 highlights the central 50% of the data, suggesting that most zip code areas have populations within this range.

3. **Zip Code Coverage**:
    - The dataset covers a wide range of zip codes, from 90001 to 96161, indicating a broad geographical coverage.
    - The mean and median zip codes are close to each other, suggesting a relatively even distribution of zip codes within the dataset.

4. **Data Completeness**:
    - All columns have complete data with no missing values, ensuring the reliability of the population information for further analysis.

These insights provide a comprehensive overview of the population distribution across different zip code areas, which can be crucial for regional analysis, market segmentation, and understanding customer demographics.

## 1.2.4 IBM Services Dataset

### General Information on the IBM Services Dataset

The `telco_customer_churn_services.xlsx` dataset, abbreviated with `df_services` in our further analysis, provides detailed information about the various services subscribed to by customers of a telecommunications company. This dataset includes features such as phone service, internet service, online security, online backup, device protection, tech support, streaming TV, streaming movies, and more. Analyzing this data helps in understanding service usage patterns, customer preferences, and identifying factors that may influence customer churn.

In [11]:
# Load the services dataset
services_file_path = '../2_data/original_data/Telecommunications_Industry/Telco_customer_churn_services.xlsx'
csv_file_path = '../2_data/Telco_customer_churn_services.csv'
df_services = pd.read_excel(services_file_path)

# Save the services dataset to a CSV file
df_services.to_csv(csv_file_path)

# Load the services dataset from the CSV file
df_services = pd.read_csv(csv_file_path)

# Display the first few rows of the services dataset
df_services.head()

Unnamed: 0.1,Unnamed: 0,Service ID,Customer ID,Count,Quarter,Referred a Friend,Number of Referrals,Tenure in Months,Offer,Phone Service,Avg Monthly Long Distance Charges,Multiple Lines,Internet Service,Internet Type,Avg Monthly GB Download,Online Security,Online Backup,Device Protection Plan,Premium Tech Support,Streaming TV,Streaming Movies,Streaming Music,Unlimited Data,Contract,Paperless Billing,Payment Method,Monthly Charge,Total Charges,Total Refunds,Total Extra Data Charges,Total Long Distance Charges,Total Revenue
0,0,IJKDQVSWH3522,8779-QRDMV,1,Q3,No,0,1,,No,0.0,No,Yes,DSL,8,No,No,Yes,No,No,Yes,No,No,Month-to-Month,Yes,Bank Withdrawal,39.65,39.65,0.0,20,0.0,59.65
1,1,BFKMZJAIE2285,7495-OOKFY,1,Q3,Yes,1,8,Offer E,Yes,48.85,Yes,Yes,Fiber Optic,17,No,Yes,No,No,No,No,No,Yes,Month-to-Month,Yes,Credit Card,80.65,633.3,0.0,0,390.8,1024.1
2,2,EIMVJQBMT7187,1658-BYGOY,1,Q3,No,0,18,Offer D,Yes,11.33,Yes,Yes,Fiber Optic,52,No,No,No,No,Yes,Yes,Yes,Yes,Month-to-Month,Yes,Bank Withdrawal,95.45,1752.55,45.61,0,203.94,1910.88
3,3,EROZQXDUU4979,4598-XLKNJ,1,Q3,Yes,1,25,Offer C,Yes,19.76,No,Yes,Fiber Optic,12,No,Yes,Yes,No,Yes,Yes,No,Yes,Month-to-Month,Yes,Bank Withdrawal,98.5,2514.5,13.43,0,494.0,2995.07
4,4,GEEYSJUHY6991,4846-WHAFZ,1,Q3,Yes,1,37,Offer C,Yes,6.33,Yes,Yes,Fiber Optic,14,No,No,No,No,No,No,No,Yes,Month-to-Month,Yes,Bank Withdrawal,76.5,2868.15,0.0,0,234.21,3102.36


In [12]:
# Display a concise summary of the demographics DataFrame
df_services.info()

# Get statistical summaries of the numerical columns in the demographics DataFrame
df_services.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 32 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   Unnamed: 0                         7043 non-null   int64  
 1   Service ID                         7043 non-null   object 
 2   Customer ID                        7043 non-null   object 
 3   Count                              7043 non-null   int64  
 4   Quarter                            7043 non-null   object 
 5   Referred a Friend                  7043 non-null   object 
 6   Number of Referrals                7043 non-null   int64  
 7   Tenure in Months                   7043 non-null   int64  
 8   Offer                              3166 non-null   object 
 9   Phone Service                      7043 non-null   object 
 10  Avg Monthly Long Distance Charges  7043 non-null   float64
 11  Multiple Lines                     7043 non-null   objec

Unnamed: 0.1,Unnamed: 0,Count,Number of Referrals,Tenure in Months,Avg Monthly Long Distance Charges,Avg Monthly GB Download,Monthly Charge,Total Charges,Total Refunds,Total Extra Data Charges,Total Long Distance Charges,Total Revenue
count,7043.0,7043.0,7043.0,7043.0,7043.0,7043.0,7043.0,7043.0,7043.0,7043.0,7043.0,7043.0
mean,3521.0,1.0,1.951867,32.386767,22.958954,20.515405,64.761692,2280.381264,1.962182,6.860713,749.099262,3034.379056
std,2033.283305,0.0,3.001199,24.542061,15.448113,20.41894,30.090047,2266.220462,7.902614,25.104978,846.660055,2865.204542
min,0.0,1.0,0.0,1.0,0.0,0.0,18.25,18.8,0.0,0.0,0.0,21.36
25%,1760.5,1.0,0.0,9.0,9.21,3.0,35.5,400.15,0.0,0.0,70.545,605.61
50%,3521.0,1.0,0.0,29.0,22.89,17.0,70.35,1394.55,0.0,0.0,401.44,2108.64
75%,5281.5,1.0,3.0,55.0,36.395,27.0,89.85,3786.6,0.0,0.0,1191.1,4801.145
max,7042.0,1.0,11.0,72.0,49.99,85.0,118.75,8684.8,49.79,150.0,3564.72,11979.34


### Key Insights from Services Data

1. **Customer Tenure**:
    - The average tenure of customers is approximately 32 months, with a standard deviation of 24.56 months. This indicates a wide range of customer loyalty, from new customers to those who have been with the company for several years.
    - The tenure ranges from 0 to 72 months, showing that the dataset includes both new and long-term customers.

2. **Monthly Charges**:
    - The average monthly charge is around $64.76, with a standard deviation of $30.09. This suggests significant variability in the services and packages subscribed to by customers.
    - Monthly charges range from $18.25 to $118.75, indicating a diverse customer base with different service levels and usage patterns.

3. **Service Subscriptions**:
    - A high proportion of customers subscribe to phone services, with many also opting for multiple lines.
    - Internet service subscriptions are common, with options including DSL, Fiber Optic, and Cable. The presence of multiple internet service types suggests varied customer preferences and availability of services.
    - Additional services such as online security, online backup, device protection, and premium tech support are also subscribed to by many customers, indicating a demand for comprehensive service packages.

4. **Streaming Services**:
    - Streaming TV, movies, and music services are popular among customers, reflecting the growing trend of using internet services for entertainment purposes.
    - The availability of unlimited data plans suggests that some customers are willing to pay extra for unrestricted data usage, likely due to high streaming and download needs.

5. **Contract and Billing Preferences**:
    - Customers have different contract types, including month-to-month, one-year, and two-year contracts. This variety allows for flexibility in customer commitments.
    - Paperless billing is a common choice, indicating a preference for digital communication and billing methods.
    - Payment methods vary, with options including bank withdrawal, credit card, and mailed check, catering to different customer preferences.

6. **Churn Analysis**:
    - The dataset includes information on customer churn, which is crucial for understanding customer retention and identifying factors that contribute to churn.
    - Analyzing the relationship between service subscriptions, tenure, monthly charges, and churn can provide valuable insights for developing strategies to reduce churn and improve customer satisfaction.

These insights highlight the diverse customer base, varied service subscriptions, and different billing preferences, which are essential for tailoring marketing strategies, improving customer retention, and enhancing overall service offerings.

## 1.2.5 IBM Status Dataset

### General Information on the IBM Status Dataset

The `telco_customer_churn_status.xlsx` dataset, abbreviated with `df_status` in our further analysis, provides detailed information about the status of customers of a telecommunications company. This dataset includes features such as tenure, contract type, payment method, and churn status. Analyzing this data helps in understanding customer retention, identifying factors that may influence customer churn, and evaluating the effectiveness of different contract types and payment methods.

In [13]:
# Load the status dataset
status_file_path = '../2_data/original_data/Telecommunications_Industry/Telco_customer_churn_status.xlsx'
csv_file_path = '../2_data/Telco_customer_churn_status.csv'
df_status = pd.read_excel(status_file_path)

# Save the status dataset to a CSV file
df_status.to_csv(csv_file_path)

# Load the status dataset from the CSV file
df_status = pd.read_csv(csv_file_path)

# Display the first few rows of the status dataset
df_status.head()

Unnamed: 0.1,Unnamed: 0,Status ID,Customer ID,Count,Quarter,Satisfaction Score,Customer Status,Churn Label,Churn Value,Churn Score,CLTV,Churn Category,Churn Reason
0,0,SWSORB1252,8779-QRDMV,1,Q3,3,Churned,Yes,1,91,5433,Competitor,Competitor offered more data
1,1,SNAEQA8572,7495-OOKFY,1,Q3,3,Churned,Yes,1,69,5302,Competitor,Competitor made better offer
2,2,LMBQNN3714,1658-BYGOY,1,Q3,2,Churned,Yes,1,81,3179,Competitor,Competitor made better offer
3,3,VRZYZI9978,4598-XLKNJ,1,Q3,2,Churned,Yes,1,88,5337,Dissatisfaction,Limited range of services
4,4,FDNAKX1688,4846-WHAFZ,1,Q3,2,Churned,Yes,1,67,2793,Price,Extra data charges


In [14]:
# Display a concise summary of the demographics DataFrame
df_status.info()

# Get statistical summaries of the numerical columns in the demographics DataFrame
df_status.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 13 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Unnamed: 0          7043 non-null   int64 
 1   Status ID           7043 non-null   object
 2   Customer ID         7043 non-null   object
 3   Count               7043 non-null   int64 
 4   Quarter             7043 non-null   object
 5   Satisfaction Score  7043 non-null   int64 
 6   Customer Status     7043 non-null   object
 7   Churn Label         7043 non-null   object
 8   Churn Value         7043 non-null   int64 
 9   Churn Score         7043 non-null   int64 
 10  CLTV                7043 non-null   int64 
 11  Churn Category      1869 non-null   object
 12  Churn Reason        1869 non-null   object
dtypes: int64(6), object(7)
memory usage: 715.4+ KB


Unnamed: 0.1,Unnamed: 0,Count,Satisfaction Score,Churn Value,Churn Score,CLTV
count,7043.0,7043.0,7043.0,7043.0,7043.0,7043.0
mean,3521.0,1.0,3.244924,0.26537,58.50504,4400.295755
std,2033.283305,0.0,1.201657,0.441561,21.170031,1183.057152
min,0.0,1.0,1.0,0.0,5.0,2003.0
25%,1760.5,1.0,3.0,0.0,40.0,3469.0
50%,3521.0,1.0,3.0,0.0,61.0,4527.0
75%,5281.5,1.0,4.0,1.0,75.5,5380.5
max,7042.0,1.0,5.0,1.0,96.0,6500.0



### Key Insights from Status Data

1. **Customer Satisfaction**:
    - The average satisfaction score is approximately 3.24, indicating a moderate level of satisfaction among customers.
    - The satisfaction scores range from 1 (Very Unsatisfied) to 5 (Very Satisfied), with a standard deviation of 1.20, suggesting varied customer experiences.

2. **Customer Churn**:
    - About 26.54% of the customers have churned, as indicated by the mean churn value of 0.265.
    - The churn score, which predicts the likelihood of a customer leaving, has an average value of 58.51, with scores ranging from 5 to 96. This indicates a significant portion of customers are at risk of churning.

3. **Customer Lifetime Value (CLTV)**:
    - The average CLTV is 4400.30, with values ranging from 2003 to 6500. This suggests a wide range of customer values, with some customers being significantly more valuable than others.
    - The standard deviation of CLTV is 1183.06, indicating substantial variability in customer value.

4. **Churn Reasons and Categories**:
    - Only a subset of the data (1869 entries) includes specific churn reasons and categories, which can provide insights into why customers are leaving.
    - Common churn categories include Attitude, Competitor, Dissatisfaction, Other, and Price, which can help in identifying areas for improvement.

5. **Data Completeness**:
    - The dataset is complete for most columns, with the exception of 'Churn Category' and 'Churn Reason', which have non-null counts of 1869, indicating that churn reasons are only recorded for a portion of the customers.

These insights can help in understanding customer behavior, identifying at-risk customers, and improving customer retention strategies.

## 1.2.6 IBM CustomerChurn Dataset

### General Information on the IBM Customer Churn Dataset

The `CustomerChurn.xlsx` dataset, abbreviated with `df_customerc` in our further analysis, provides detailed information about the customers of a telecommunications company. This dataset includes features such as customer demographics, account information, services subscribed, and churn status. Analyzing this data helps in understanding customer retention, identifying factors that may influence customer churn, and evaluating the effectiveness of different service offerings and contract types.

In [15]:
# Load the status dataset
customerc_file_path = '../2_data/original_data/Telecommunications_Industry/CustomerChurn.xlsx'
csv_file_path = '../2_data/Telco_customer_churn_customerc.csv'
df_customerc = pd.read_excel(customerc_file_path)

# Save the status dataset to a CSV file
df_customerc.to_csv(csv_file_path)

# Load the status dataset from the CSV file
df_customerc = pd.read_csv(csv_file_path)

# Display the first few rows of the status dataset
df_customerc.head()

Unnamed: 0.1,Unnamed: 0,LoyaltyID,Customer ID,Senior Citizen,Partner,Dependents,Tenure,Phone Service,Multiple Lines,Internet Service,Online Security,Online Backup,Device Protection,Tech Support,Streaming TV,Streaming Movies,Contract,Paperless Billing,Payment Method,Monthly Charges,Total Charges,Churn
0,0,318537,7590-VHVEG,No,Yes,No,1,No,No phone service,DSL,No,Yes,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,1,152148,5575-GNVDE,No,No,No,34,Yes,No,DSL,Yes,No,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,2,326527,3668-QPYBK,No,No,No,2,Yes,No,DSL,Yes,Yes,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,3,845894,7795-CFOCW,No,No,No,45,No,No phone service,DSL,Yes,No,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,4,503388,9237-HQITU,No,No,No,2,Yes,No,Fiber optic,No,No,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [16]:
# Display a concise summary of the demographics DataFrame
df_customerc.info()

# Get statistical summaries of the numerical columns in the demographics DataFrame
df_customerc.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 22 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Unnamed: 0         7043 non-null   int64  
 1   LoyaltyID          7043 non-null   int64  
 2   Customer ID        7043 non-null   object 
 3   Senior Citizen     7043 non-null   object 
 4   Partner            7043 non-null   object 
 5   Dependents         7043 non-null   object 
 6   Tenure             7043 non-null   int64  
 7   Phone Service      7043 non-null   object 
 8   Multiple Lines     7043 non-null   object 
 9   Internet Service   7043 non-null   object 
 10  Online Security    7043 non-null   object 
 11  Online Backup      7043 non-null   object 
 12  Device Protection  7043 non-null   object 
 13  Tech Support       7043 non-null   object 
 14  Streaming TV       7043 non-null   object 
 15  Streaming Movies   7043 non-null   object 
 16  Contract           7043 

Unnamed: 0.1,Unnamed: 0,LoyaltyID,Tenure,Monthly Charges
count,7043.0,7043.0,7043.0,7043.0
mean,3521.0,550382.651001,32.371149,64.761692
std,2033.283305,260776.11869,24.559481,30.090047
min,0.0,100346.0,0.0,18.25
25%,1760.5,323604.5,9.0,35.5
50%,3521.0,548704.0,29.0,70.35
75%,5281.5,776869.0,55.0,89.85
max,7042.0,999912.0,72.0,118.75


### Key Insights from Customer Churn Data

1. **Customer Loyalty**:
    - The `LoyaltyID` column indicates a unique identifier for customer loyalty. The wide range of values suggests varying levels of customer loyalty, which can be further analyzed to understand retention patterns.

2. **Tenure**:
    - The `Tenure` column shows the number of months a customer has been with the company. The average tenure is approximately 32 months, with a standard deviation of 24.56 months. This indicates a diverse range of customer loyalty, from new customers to those who have been with the company for several years.
    - The tenure ranges from 0 to 72 months, showing that the dataset includes both new and long-term customers.

3. **Monthly Charges**:
    - The `Monthly Charges` column represents the monthly billing amount for each customer. The average monthly charge is around $64.76, with a standard deviation of $30.09. This suggests significant variability in the services and packages subscribed to by customers.
    - Monthly charges range from $18.25 to $118.75, indicating a diverse customer base with different service levels and usage patterns.

4. **Service Subscriptions**:
    - The dataset includes various service-related columns such as `Phone Service`, `Multiple Lines`, `Internet Service`, `Online Security`, `Online Backup`, `Device Protection`, `Tech Support`, `Streaming TV`, and `Streaming Movies`. These columns indicate whether customers subscribe to these services, providing insights into service usage patterns and customer preferences.

5. **Contract Types**:
    - The `Contract` column shows the type of contract customers have, including month-to-month, one-year, and two-year contracts. This variety allows for flexibility in customer commitments and can be analyzed to understand the impact of contract types on churn.

6. **Billing and Payment Methods**:
    - The `Paperless Billing` column indicates whether customers use paperless billing, and the `Payment Method` column shows the payment methods used by customers, such as electronic check, mailed check, bank transfer, and credit card. These columns provide insights into customer preferences for billing and payment, which can help tailor customer service and billing processes.

7. **Churn**:
    - The `Churn` column indicates whether customers have churned. This is the target variable for predictive modeling and is crucial for understanding customer retention. Analyzing the relationship between churn and other features can provide valuable insights for developing strategies to reduce churn and improve customer satisfaction.

These insights highlight the diverse customer base, varied service subscriptions, and different billing preferences, which are essential for tailoring marketing strategies, improving customer retention, and enhancing overall service offerings.

## 1.2.7 IBM Telco_Customer_churn Dataset

### General Information on the IBM Telco Customer Churn Dataset

The `Telco_customer_churn.xlsx` dataset, abbreviated with `df_telcocustomerc` in our further analysis, provides detailed information about the customers of a telecommunications company. This dataset includes features such as customer demographics, account information, services subscribed, and churn status. Analyzing this data helps in understanding customer retention, identifying factors that may influence customer churn, and evaluating the effectiveness of different service offerings and contract types.

In [17]:
# Load the services dataset
telcocustomerc_file_path = '../2_data/original_data/Telecommunications_Industry/Telco_customer_churn.xlsx'
csv_file_path = '../2_data/Telco_customer_churn_telcocustomerc.csv'
df_telcocustomerc = pd.read_excel(telcocustomerc_file_path)

# Save the services dataset to a CSV file
df_telcocustomerc.to_csv(csv_file_path)

# Load the services dataset from the CSV file
df_telcocustomerc = pd.read_csv(csv_file_path)

# Display the first few rows of the services dataset
df_telcocustomerc.head()

Unnamed: 0.1,Unnamed: 0,CustomerID,Count,Country,State,City,Zip Code,Lat Long,Latitude,Longitude,Gender,Senior Citizen,Partner,Dependents,Tenure Months,Phone Service,Multiple Lines,Internet Service,Online Security,Online Backup,Device Protection,Tech Support,Streaming TV,Streaming Movies,Contract,Paperless Billing,Payment Method,Monthly Charges,Total Charges,Churn Label,Churn Value,Churn Score,CLTV,Churn Reason
0,0,3668-QPYBK,1,United States,California,Los Angeles,90003,"33.964131, -118.272783",33.964131,-118.272783,Male,No,No,No,2,Yes,No,DSL,Yes,Yes,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes,1,86,3239,Competitor made better offer
1,1,9237-HQITU,1,United States,California,Los Angeles,90005,"34.059281, -118.30742",34.059281,-118.30742,Female,No,No,Yes,2,Yes,No,Fiber optic,No,No,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes,1,67,2701,Moved
2,2,9305-CDSKC,1,United States,California,Los Angeles,90006,"34.048013, -118.293953",34.048013,-118.293953,Female,No,No,Yes,8,Yes,Yes,Fiber optic,No,No,Yes,No,Yes,Yes,Month-to-month,Yes,Electronic check,99.65,820.5,Yes,1,86,5372,Moved
3,3,7892-POOKP,1,United States,California,Los Angeles,90010,"34.062125, -118.315709",34.062125,-118.315709,Female,No,Yes,Yes,28,Yes,Yes,Fiber optic,No,No,Yes,Yes,Yes,Yes,Month-to-month,Yes,Electronic check,104.8,3046.05,Yes,1,84,5003,Moved
4,4,0280-XJGEX,1,United States,California,Los Angeles,90015,"34.039224, -118.266293",34.039224,-118.266293,Male,No,No,Yes,49,Yes,Yes,Fiber optic,No,Yes,Yes,No,Yes,Yes,Month-to-month,Yes,Bank transfer (automatic),103.7,5036.3,Yes,1,89,5340,Competitor had better devices


In [18]:
# Display a concise summary of the demographics DataFrame
df_telcocustomerc.info()

# Get statistical summaries of the numerical columns in the demographics DataFrame
df_telcocustomerc.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 34 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Unnamed: 0         7043 non-null   int64  
 1   CustomerID         7043 non-null   object 
 2   Count              7043 non-null   int64  
 3   Country            7043 non-null   object 
 4   State              7043 non-null   object 
 5   City               7043 non-null   object 
 6   Zip Code           7043 non-null   int64  
 7   Lat Long           7043 non-null   object 
 8   Latitude           7043 non-null   float64
 9   Longitude          7043 non-null   float64
 10  Gender             7043 non-null   object 
 11  Senior Citizen     7043 non-null   object 
 12  Partner            7043 non-null   object 
 13  Dependents         7043 non-null   object 
 14  Tenure Months      7043 non-null   int64  
 15  Phone Service      7043 non-null   object 
 16  Multiple Lines     7043 

Unnamed: 0.1,Unnamed: 0,Count,Zip Code,Latitude,Longitude,Tenure Months,Monthly Charges,Churn Value,Churn Score,CLTV
count,7043.0,7043.0,7043.0,7043.0,7043.0,7043.0,7043.0,7043.0,7043.0,7043.0
mean,3521.0,1.0,93521.964646,36.282441,-119.79888,32.371149,64.761692,0.26537,58.699418,4400.295755
std,2033.283305,0.0,1865.794555,2.455723,2.157889,24.559481,30.090047,0.441561,21.525131,1183.057152
min,0.0,1.0,90001.0,32.555828,-124.301372,0.0,18.25,0.0,5.0,2003.0
25%,1760.5,1.0,92102.0,34.030915,-121.815412,9.0,35.5,0.0,40.0,3469.0
50%,3521.0,1.0,93552.0,36.391777,-119.730885,29.0,70.35,0.0,61.0,4527.0
75%,5281.5,1.0,95351.0,38.224869,-118.043237,55.0,89.85,1.0,75.0,5380.5
max,7042.0,1.0,96161.0,41.962127,-114.192901,72.0,118.75,1.0,100.0,6500.0


### Key Insights from Telco Customer Churn Data

1. **Customer Demographics**:
  - The dataset includes demographic information such as `Gender`, `Senior Citizen`, `Partner`, and `Dependents`. This allows for analysis of how different demographic groups are distributed and how they might influence churn.

2. **Geographical Information**:
  - Columns like `Country`, `State`, `City`, `Zip Code`, `Latitude`, and `Longitude` provide detailed geographical data. This can be used to analyze regional trends and identify areas with higher churn rates.

3. **Service Details**:
  - The dataset contains information on various services subscribed by customers, including `Phone Service`, `Multiple Lines`, `Internet Service`, `Online Security`, `Online Backup`, `Device Protection`, `Tech Support`, `Streaming TV`, and `Streaming Movies`. This helps in understanding customer preferences and service usage patterns.

4. **Contract and Billing**:
  - Columns such as `Contract`, `Paperless Billing`, and `Payment Method` provide insights into the types of contracts customers have and their billing preferences. This can be crucial for analyzing the impact of contract types and billing methods on churn.

5. **Financial Information**:
  - `Monthly Charges` and `Total Charges` give an overview of the financial aspect of customer subscriptions. The wide range of charges indicates diverse service levels and customer spending patterns.

6. **Churn Information**:
  - The dataset includes several columns related to churn, such as `Churn Label`, `Churn Value`, `Churn Score`, `CLTV`, and `Churn Reason`. These columns are essential for understanding customer retention and identifying factors contributing to churn.

7. **Tenure**:
  - The `Tenure Months` column shows the duration of time customers have been with the company. The average tenure is approximately 32 months, with a wide range from 0 to 72 months, indicating a diverse customer base in terms of loyalty.

8. **Customer Lifetime Value (CLTV)**:
  - The `CLTV` column represents the customer lifetime value, which is a crucial metric for understanding the long-term value of customers. The average CLTV is around 4400, with significant variability among customers.

9. **Churn Score and Reasons**:
  - The `Churn Score` provides a numerical representation of the likelihood of a customer churning, with values ranging from 5 to 100. The `Churn Reason` column, although sparsely populated, offers qualitative insights into why customers are leaving.

10. **Data Completeness**:
   - While most columns have complete data, some columns like `Churn Reason` have a significant number of missing values. This indicates that churn reasons are only recorded for a subset of customers, which might limit the analysis of churn causes.

The Telco Customer Churn dataset provides a comprehensive view of customer demographics, geographical information, service details, financial information, and churn-related metrics. The diversity in customer profiles, service usage, and financial aspects allows for in-depth analysis of factors influencing customer churn. Understanding these insights can help in developing targeted strategies to improve customer retention and satisfaction.