# Step 5: Data Exploration

## Preprocessing Pipeline Overview

This preprocessing pipeline outlines the steps necessary to prepare the Telco Customer Churn dataset for our modeling. Each step is designed to address specific aspects of data quality, transformation, and feature creation. We will cover each step in a separate jupyter notebook file.

**Step 1: Data Loading**: Loading the datasets into the workspace, ensuring all necessary files are correctly imported for analysis. This includes the Kaggle dataset and the IBM datasets.

**Step 2: Dataset Integration**: Combining relevant datasets into a single, unified dataset that will serve as the foundation for subsequent analysis.

**Step 3: Handling Missing Values**: Identifying and addressing missing values in the dataset to ensure data integrity. This step ensures no significant gaps hinder the analysis.

**Step 4: Data Type Conversion**: Converting data columns to appropriate data types to optimize memory usage and prepare for feature engineering. Ensure consistency across all columns.

**Step 5: Data Exploration**: Perform initial exploratory data analysis (EDA) to understand the dataset's structure and characteristics, visualizing key features to gain insights into the data.

**Step 6: Feature Engineering**: Creating new features from the existing data to enhance model performance and capture additional insights. This includes transformations and derived features.

**Step 7: Outlier Detection**: Identifying and addressing outliers in the dataset to ensure they do not negatively impact the analysis or models.

**Step 8: Dataset Splitting**: Splitting the dataset into training and testing subsets to prepare for model development and evaluation. This step ensures reproducibility and robust performance metrics.

In [None]:
import pandas as pd

# Load the dataset
df = pd.read_csv('../2_data/telcocustomerchurn_combined.csv')

Dataset Description:
        Unnamed: 0   Count          Age  Number of Dependents      Zip Code  \
count  7043.000000  7043.0  7043.000000           7043.000000   7043.000000   
mean   3521.000000     1.0    46.509726              0.468692  93486.071134   
std    2033.283305     0.0    16.750352              0.962802   1856.768045   
min       0.000000     1.0    19.000000              0.000000  90001.000000   
25%    1760.500000     1.0    32.000000              0.000000  92101.000000   
50%    3521.000000     1.0    46.000000              0.000000  93518.000000   
75%    5281.500000     1.0    60.000000              0.000000  95329.000000   
max    7042.000000     1.0    80.000000              9.000000  96150.000000   

          Latitude    Longitude  Number of Referrals  Tenure in Months  \
count  7043.000000  7043.000000          7043.000000       7043.000000   
mean     36.197455  -119.756684             1.951867         32.386767   
std       2.468929     2.154425             3

### Key Findings from the Telco Customer Churn Dataset Description

1. **General Statistics**:
    - The dataset contains 7043 entries.
    - The average age of customers is approximately 46.5 years, with a standard deviation of 16.75 years.
    - The age range of customers is between 19 and 80 years.

2. **Dependents**:
    - The average number of dependents per customer is 0.47, indicating that most customers have fewer than one dependent.
    - The maximum number of dependents for any customer is 9.

4. **Satisfaction and Churn**:
    - The average satisfaction score is 3.24 out of 5.
    - The churn value indicates that approximately 26.5% of customers have churned.
    - The churn score ranges from 5 to 96, with an average of 58.5.

5. **Customer Lifetime Value (CLTV)**:
    - The average CLTV is 4400.30, with a standard deviation of 1183.06.
    - The CLTV ranges from 2003 to 6500.

6. **Tenure and Charges**:
    - The average tenure of customers is approximately 32.37 months.
    - Monthly charges range from 18.25 to 118.75, with an average of 64.76.

These insights provide a comprehensive overview of the dataset, highlighting key characteristics and distributions of various features.

In [11]:
# Number of rows and columns
print("\nNumber of rows and columns:")
print(df.shape)

# Data types of each column
print("\nData types of each column:")
print(df.dtypes)


Number of rows and columns:
(7043, 63)

Data types of each column:
Unnamed: 0                             int64
Customer ID                           object
Count                                  int64
Gender                                object
Age                                    int64
Under 30                              object
Senior Citizen                        object
Married                               object
Dependents                            object
Number of Dependents                   int64
Location ID                           object
Country                               object
State                                 object
City                                  object
Zip Code                               int64
Lat Long                              object
Latitude                             float64
Longitude                            float64
Service ID                            object
Quarter                               object
Referred a Friend               


## Data Exploration Outline for Telco Customer Churn Dataset

### 1. Overview of the Dataset
- Brief description of the dataset
- Number of rows and columns
- Data types of each column

### 2. Summary Statistics
- Descriptive statistics for numerical columns (mean, median, standard deviation, etc.)
- Frequency distribution for categorical columns

### 3. Missing Values Analysis
- Identify columns with missing values
- Percentage of missing values per column
- Visual representation of missing values (e.g., heatmap)

### 4. Distribution of Numerical Features
- Histograms for numerical columns
- Box plots to identify outliers

### 5. Distribution of Categorical Features
- Bar plots for categorical columns
- Pie charts for categorical columns with few unique values

### 6. Correlation Analysis
- Correlation matrix for numerical features
- Heatmap of the correlation matrix

### 7. Churn Analysis
- Distribution of the target variable (Churn)
- Comparison of features between churned and non-churned customers

### 8. Feature Relationships
- Pair plots to visualize relationships between numerical features
- Grouped bar plots to compare categorical features with the target variable

### 9. Insights and Observations
- Key findings from the data exploration
- Potential features for modeling
- Any data quality issues identified