# Preprocessing of the Customer Churn Dataset

The dataset `Customer-Churn.csv` is located at `../2_data/original_data/Customer-Churn.csv`. Here is our structured approach to preprocess the data:

1. **Loading the Dataset**:
    - Use pandas to load the CSV file into a DataFrame.

2. **Understanding the Data**:
    - Displaying the first few rows of the dataset using `head()`.
    - Using `info()` to get a concise summary of the DataFrame, including the data types and non-null counts.
    - Using `describe()` to get statistical summaries of the numerical columns.

3. **Handling Missing Values**:
    - Identifying columns with missing values using `isnull().sum()`.
    - Deciding on strategies to handle missing values, such as filling with mean/median/mode or dropping rows/columns.

4. **Converting Data Types**:
    - Converting categorical columns to the 'category' data type.
    - Ensuring numerical columns are in the correct format (e.g., integers, floats).

5. **Encoding Categorical Variables**:
    - Using one-hot encoding or label encoding for categorical variables to convert them into numerical format.

6. **Feature Engineering**:
    - Creating new features if necessary, such as aggregating or transforming existing features.
    - Normalizing or standardize numerical features to bring them to a similar scale.

7. **Handling Outliers**:
    - Identifying outliers using statistical methods or visualization techniques.
    - Deciding on strategies to handle outliers, such as capping or removing them.

8. **Splitting the Data**:
    - Splitting the dataset into training and testing sets to evaluate the model performance.

By following these steps, we can ensure that the dataset is clean, well-structured, and ready for further analysis or modeling.


In [21]:
import pandas as pd
import os

# 1. **Loading the Dataset**

In [22]:
# Layout of the data
pd.set_option('display.max_columns', None)  # Show all columns
pd.set_option('display.expand_frame_repr', False)  # Prevent line wrapping

## 1.1 Original Kaggle Dataset

In [23]:
# Load the dataset
file_path = '../2_data/original_data/Customer-Churn.csv'
df_original = pd.read_csv(file_path)

df_original.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,Yes,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,No,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,Yes,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,No,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,No,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


### General Information on the Kaggle Dataset

The `df_original` dataframe contains customer information from a telecommunications company. It includes 21 columns with details such as customer ID, gender, senior citizen status, partner status, dependents, tenure, phone service, multiple lines, internet service, online security, online backup, device protection, tech support, streaming TV, streaming movies, contract type, paperless billing, payment method, monthly charges, total charges, and churn status. This dataset is sourced from [Kaggle](https://www.kaggle.com/datasets/blastchar/telco-customer-churn).

For more comprehensive data, we found additional datasets in a newer vewsion available on the corresponding [IBM website](https://community.ibm.com/community/user/businessanalytics/blogs/steven-macko/2019/07/11/telco-customer-churn-1113).

In [None]:
# Display a concise summary of the DataFrame
df_original.info()

# Get statistical summaries of the numerical columns
df_original.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges
count,7043.0,7043.0,7043.0
mean,0.162147,32.371149,64.761692
std,0.368612,24.559481,30.090047
min,0.0,0.0,18.25
25%,0.0,9.0,35.5
50%,0.0,29.0,70.35
75%,0.0,55.0,89.85
max,1.0,72.0,118.75


### Analysis of the Original Kaggle Dataset

#### Key Insights:

1. **Customer Demographics**:
    - **Gender**: The dataset includes both male and female customers.
    - **Senior Citizens**: Approximately 16.2% of the customers are senior citizens.

2. **Account Information**:
    - **Partner and Dependents**: A significant portion of customers have partners, but fewer have dependents.
    - **Contract Types**: Customers have various contract types, including month-to-month, one-year, and two-year contracts.
    - **Payment Methods**: Customers use different payment methods, such as electronic check, mailed check, bank transfer, and credit card.

3. **Service Usage**:
    - **Phone and Internet Services**: The dataset includes customers with and without phone service, multiple lines, and different types of internet services (DSL, Fiber optic).
    - **Additional Services**: Customers subscribe to various additional services like online security, online backup, device protection, tech support, streaming TV, and streaming movies.

4. **Charges**:
    - **Monthly Charges**: The average monthly charge is approximately $64.76, with a wide range from $18.25 to $118.75.
    - **Total Charges**: The total charges vary significantly among customers, reflecting different service usage and tenure.

5. **Customer Tenure**:
    - **Tenure**: The tenure of customers ranges from 0 to 72 months, with a median of 29 months. This indicates a diverse range of customer loyalty.

6. **Churn**:
    - **Churn Rate**: The dataset includes information on whether customers have churned, which is crucial for predictive modeling and understanding customer retention.

### Summary Statistics:

- **SeniorCitizen**:
  - Mean: 0.162
  - Standard Deviation: 0.369
  - Min: 0
  - Max: 1

- **Tenure**:
  - Mean: 32.37 months
  - Standard Deviation: 24.56 months
  - Min: 0 months
  - Max: 72 months

- **MonthlyCharges**:
  - Mean: $64.76
  - Standard Deviation: $30.09
  - Min: $18.25
  - Max: $118.75

These insights provide a comprehensive overview of the customer demographics, account information, service usage, charges, tenure, and churn, which are essential for further analysis and modeling.

### Conclusion on the Original Kaggle Dataset and next steps

There is a newer version of the dataset available on the IBM website, which provides more comprehensive and detailed data. We will focus on this dataset for our analysis and modeling.

## 1.2 IBM Dataset

The `(../2_data/Telecommunications_Industry)` folder contains various Excel files sourced from the IBM Website. Each file within this folder pertains to a specific topic relevant to the telecommunications industry. Below is a brief description of each topic covered by these files:

**Telco_customer_churn_demographics.xlsx**

This file contains demographic information about the customers, such as age, gender, and marital status. This data is useful for understanding the customer base and performing market segmentation.


**Telco_customer_churn_location.xlsx**

This file includes data on the geographical location of customers, including their state, city, and zip code. It helps in analyzing the distribution of customers across different regions.

**Telco_customer_churn_population.xlsx**

This file provides details on the population statistics related to the customers, such as population density and other relevant metrics. It helps in understanding the broader demographic context of the customer base.

**Telco_customer_churn_services.xlsx**

This file includes data on the various services subscribed to by customers, such as phone service, internet service, and additional features like online security and tech support. It helps in analyzing service usage patterns and customer preferences.

**Telco_customer_churn_status.xlsx**

This file captures the status of customers, including their tenure, contract type, payment method, and churn status. It is essential for understanding customer retention and identifying factors contributing to churn.

Each of these files provides valuable insights into different aspects of the telecommunications industry, enabling data-driven decision-making and strategic planning.


## 1.2.1 IBM Demographics Dataset

In [28]:
# Load the demographics dataset
demographics_file_path = '../2_data/original_data/Telecommunications_Industry/Telco_customer_churn_demographics.xlsx'
csv_file_path = '../2_data/Telco_customer_churn_demographics.csv'
df_demographics = pd.read_excel(demographics_file_path)

# Save the demographics dataset to a CSV file
df_demographics.to_csv(csv_file_path)

# Load the demographics dataset from the CSV file
df_demographics = pd.read_csv(csv_file_path)

# Display the first few rows of the demographics dataset
df_demographics.head()

Unnamed: 0.1,Unnamed: 0,Customer ID,Count,Gender,Age,Under 30,Senior Citizen,Married,Dependents,Number of Dependents
0,0,8779-QRDMV,1,Male,78,No,Yes,No,No,0
1,1,7495-OOKFY,1,Female,74,No,Yes,Yes,Yes,1
2,2,1658-BYGOY,1,Male,71,No,Yes,No,Yes,3
3,3,4598-XLKNJ,1,Female,78,No,Yes,Yes,Yes,1
4,4,4846-WHAFZ,1,Female,80,No,Yes,Yes,Yes,1


### General Information on the IBM Demographics Dataset

The `telco_customer_churn_demographics.xlsx` dataset, abbreviated with `df_demographics` in our further analysis, provides detailed demographic information about the customers of a telecommunications company. This dataset includes the following key features.

In [36]:
# Display a concise summary of the demographics DataFrame
df_demographics.info()

# Get statistical summaries of the numerical columns in the demographics DataFrame
df_demographics.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 10 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   Unnamed: 0            7043 non-null   int64 
 1   Customer ID           7043 non-null   object
 2   Count                 7043 non-null   int64 
 3   Gender                7043 non-null   object
 4   Age                   7043 non-null   int64 
 5   Under 30              7043 non-null   object
 6   Senior Citizen        7043 non-null   object
 7   Married               7043 non-null   object
 8   Dependents            7043 non-null   object
 9   Number of Dependents  7043 non-null   int64 
dtypes: int64(4), object(6)
memory usage: 550.4+ KB


Unnamed: 0.1,Unnamed: 0,Count,Age,Number of Dependents
count,7043.0,7043.0,7043.0,7043.0
mean,3521.0,1.0,46.509726,0.468692
std,2033.283305,0.0,16.750352,0.962802
min,0.0,1.0,19.0,0.0
25%,1760.5,1.0,32.0,0.0
50%,3521.0,1.0,46.0,0.0
75%,5281.5,1.0,60.0,0.0
max,7042.0,1.0,80.0,9.0


### Key Insights from Demographics Data

Our goal is to get a general understanding of this dataset. We will analyze it in detail later.

1. **Gender Distribution**:
    - The dataset includes both male and female customers, providing a balanced view of the customer base.

2. **Age Distribution**:
    - The average age of customers is approximately 46.5 years.
    - The age range spans from 19 to 80 years, indicating a diverse age group among the customers.
    - The standard deviation of 16.75 years suggests a wide variation in customer ages.

3. **Senior Citizens**:
    - The dataset includes information on whether customers are senior citizens (65 years or older). This can help in understanding the needs and preferences of older customers.

4. **Marital Status**:
    - The dataset captures whether customers are married, which can be useful for segmenting customers based on their family status.

5. **Dependents**:
    - Information on whether customers have dependents and the number of dependents they have is available.
    - The average number of dependents is approximately 0.47, with a maximum of 9 dependents.
    - A significant portion of customers do not have any dependents, as indicated by the 25th, 50th, and 75th percentiles all being 0.

6. **Customer Count**:
    - The `Count` column is used for reporting and dashboarding purposes, indicating the number of customers in a filtered set.

These insights provide a comprehensive overview of the customer demographics, which can be crucial for targeted marketing, customer segmentation, and personalized service offerings.

## 1.2.2 IBM Location Dataset

### General Information on the IBM Location Dataset

The `telco_customer_churn_location.xlsx` dataset, abbreviated with `df_location` in our further analysis, provides detailed location information about the customers of a telecommunications company. This data helps us understand the geographical distribution of customers, which can be crucial for regional marketing strategies, service optimization, and identifying location-specific customer needs. By analyzing the location data, we can gain insights into customer concentration in different areas, regional preferences, and potential areas for expanding services or improving customer support.

In [37]:
# Load the location dataset
location_file_path = '../2_data/original_data/Telecommunications_Industry/Telco_customer_churn_location.xlsx'
csv_file_path = '../2_data/Telco_customer_churn_location.csv'
df_location = pd.read_excel(location_file_path)

# Save the location dataset to a CSV file
df_location.to_csv(csv_file_path)

# Load the location dataset from the CSV file
df_location = pd.read_csv(csv_file_path)

# Display the first few rows of the location dataset
df_location.head()

Unnamed: 0.1,Unnamed: 0,Location ID,Customer ID,Count,Country,State,City,Zip Code,Lat Long,Latitude,Longitude
0,0,OXCZEW7397,8779-QRDMV,1,United States,California,Los Angeles,90022,"34.02381, -118.156582",34.02381,-118.156582
1,1,FCCECI8494,7495-OOKFY,1,United States,California,Los Angeles,90063,"34.044271, -118.185237",34.044271,-118.185237
2,2,HEHUQY7254,1658-BYGOY,1,United States,California,Los Angeles,90065,"34.108833, -118.229715",34.108833,-118.229715
3,3,WIUHRF2613,4598-XLKNJ,1,United States,California,Inglewood,90303,"33.936291, -118.332639",33.936291,-118.332639
4,4,CFEZBF4415,4846-WHAFZ,1,United States,California,Whittier,90602,"33.972119, -118.020188",33.972119,-118.020188


In [38]:
# Display a concise summary of the demographics DataFrame
df_location.info()

# Get statistical summaries of the numerical columns in the demographics DataFrame
df_location.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Unnamed: 0   7043 non-null   int64  
 1   Location ID  7043 non-null   object 
 2   Customer ID  7043 non-null   object 
 3   Count        7043 non-null   int64  
 4   Country      7043 non-null   object 
 5   State        7043 non-null   object 
 6   City         7043 non-null   object 
 7   Zip Code     7043 non-null   int64  
 8   Lat Long     7043 non-null   object 
 9   Latitude     7043 non-null   float64
 10  Longitude    7043 non-null   float64
dtypes: float64(2), int64(3), object(6)
memory usage: 605.4+ KB


Unnamed: 0.1,Unnamed: 0,Count,Zip Code,Latitude,Longitude
count,7043.0,7043.0,7043.0,7043.0,7043.0
mean,3521.0,1.0,93486.071134,36.197455,-119.756684
std,2033.283305,0.0,1856.768045,2.468929,2.154425
min,0.0,1.0,90001.0,32.555828,-124.301372
25%,1760.5,1.0,92101.0,33.990646,-121.78809
50%,3521.0,1.0,93518.0,36.205465,-119.595293
75%,5281.5,1.0,95329.0,38.161321,-117.969795
max,7042.0,1.0,96150.0,41.962127,-114.192901


### Key Insights from Location Data

1. **Geographical Distribution**:
    - The dataset covers a wide range of zip codes, indicating a diverse geographical distribution of customers.
    - The latitude and longitude values suggest that customers are spread across various regions, with significant variation in their locations.

2. **Customer Concentration**:
    - The mean and median zip codes are close to each other, suggesting a relatively even distribution of customers across different areas.
    - The standard deviation of the zip codes indicates some variability, but not extreme, in the customer locations.

3. **State and City Representation**:
    - The dataset includes customers from multiple states and cities, providing a comprehensive view of the customer base across different regions.
    - This diversity can be useful for regional analysis and understanding location-specific trends.

4. **Latitude and Longitude Range**:
    - The range of latitude and longitude values shows that customers are located from the southernmost to the northernmost parts of the region, as well as from the easternmost to the westernmost parts.
    - This wide range can help in analyzing regional differences in customer behavior and preferences.

5. **Data Completeness**:
    - All columns have complete data with no missing values, ensuring the reliability of the location information for further analysis.

These insights highlight the geographical diversity and distribution of the customer base, which can be crucial for regional marketing strategies, service optimization, and understanding location-specific customer needs.

## 1.2.3 IBM Population Dataset

### General Information on the IBM Location Dataset

The `telco_customer_churn_population.xlsx` dataset, abbreviated with `df_population` in our further analysis, provides detailed population information related to the customers of a telecommunications company. This dataset helps in understanding the broader demographic context of the customer base, which can be useful for regional analysis and identifying trends based on population density.

In [39]:
# Load the population dataset
population_file_path = '../2_data/original_data/Telecommunications_Industry/Telco_customer_churn_population.xlsx'
csv_file_path = '../2_data/Telco_customer_churn_population.csv'
df_population = pd.read_excel(population_file_path)

# Save the population dataset to a CSV file
df_population.to_csv(csv_file_path)

# Load the population dataset from the CSV file
df_population = pd.read_csv(csv_file_path)

# Display the first few rows of the population dataset
df_population.head()

Unnamed: 0.1,Unnamed: 0,ID,Zip Code,Population
0,0,1,90001,54492
1,1,2,90002,44586
2,2,3,90003,58198
3,3,4,90004,67852
4,4,5,90005,43019


### Key Insights from Demographics Data

Write a description of the key insights of the location dataset by analyzing the features and their produced output of the general information. Do not repeat what the output already has produced but rather extract the key insights.

## 1.2.4 IBM Service Dataset

### Key Insights from Demographics Data

Write a description of the key insights of the location dataset by analyzing the features and their produced output of the general information. Do not repeat what the output already has produced but rather extract the key insights.

## 1.2.5 IBM Status Dataset

### Key Insights from Demographics Data

Write a description of the key insights of the location dataset by analyzing the features and their produced output of the general information. Do not repeat what the output already has produced but rather extract the key insights.

## 1.2.6 IBM CustomerChurn Dataset

### Key Insights from Demographics Data

Write a description of the key insights of the location dataset by analyzing the features and their produced output of the general information. Do not repeat what the output already has produced but rather extract the key insights.

## 1.2.7 IBM Telco_Customer_churn Dataset

### Key Insights from Demographics Data

Write a description of the key insights of the location dataset by analyzing the features and their produced output of the general information. Do not repeat what the output already has produced but rather extract the key insights.

### Dataset Overview

The dataset `df` contains customer information from a telecommunications company. Here are some key observations based on the dataset head:

#### Data Structure
- **Number of Columns**: The dataset has 66 columns, indicating a comprehensive set of features related to customer demographics, account information, and service usage.
- **Number of Rows**: The dataset has 43,929 rows, suggesting a large sample size for analysis.

#### Data Quality
- **Missing Values**: Several columns have missing values. For example:
    - `ID`, `Zip Code`, and `Population` have a significant number of missing values.
    - `CustomerID`, `Count`, `Country`, `State`, `City`, `Lat Long`, `Latitude`, `Longitude`, `Gender`, `Senior Citizen`, `Partner`, `Dependents`, `Phone Service`, `Multiple Lines`, `Internet Service`, `Online Security`, `Online Backup`, `Device Protection`, `Tech Support`, `Streaming TV`, `Streaming Movies`, `Contract`, `Paperless Billing`, `Payment Method`, `Monthly Charges`, `Total Charges`, `Churn Label`, `Churn Value`, `Churn Score`, `CLTV`, `Churn Reason`, `Customer ID`, `Age`, `Under 30`, `Married`, `Number of Dependents`, `Service ID`, `Quarter`, `Referred a Friend`, `Number of Referrals`, `Tenure in Months`, `Offer`, `Avg Monthly Long Distance Charges`, `Internet Type`, `Avg Monthly GB Download`, `Device Protection Plan`, `Premium Tech Support`, `Streaming Music`, `Unlimited Data`, `Monthly Charge`, `Total Refunds`, `Total Extra Data Charges`, `Total Long Distance Charges`, `Total Revenue`, `Status ID`, `Satisfaction Score`, `Customer Status`, `Churn Category`, `Location ID`, `LoyaltyID`, `Tenure`, and `Churn` also have missing values.
- **Data Types**: The dataset contains a mix of numerical and categorical data types. Some columns, like `Total Charges`, are stored as objects but likely contain numerical data, indicating a need for data type conversion.

#### Conclusion
Given the presence of missing values and the need for data type conversions, the quality of the data requires improvement through preprocessing steps such as handling missing values, converting data types, and encoding categorical variables.

Additionally, there is another dataset in the telecommunications industry (`dfs`) that might contain more comprehensive and detailed data. This dataset is split into multiple dataframes, each focusing on different aspects of customer information. Combining and analyzing these datasets can provide a more holistic view of customer behavior and improve the quality of insights derived from the data.
```

In [None]:
# Display a concise summary of the DataFrame
print(df_original.info())

# Get statistical summaries of the numerical columns
print(df_original.describe())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


In [None]:



# Directory containing the CSV files
directory_path = '../2_data/Telecommunications_Industry'

# List to hold dataframes
dfs = []

# Iterate over all files in the directory
for filename in os.listdir(directory_path):
    if filename.endswith('.xlsx'):
        file_path = os.path.join(directory_path, filename)
        dfs.append(pd.read_excel(file_path))

# Concatenate all dataframes
df = pd.concat(dfs, ignore_index=True)
df.head()

Unnamed: 0,ID,Zip Code,Population,CustomerID,Count,Country,State,City,Lat Long,Latitude,...,Total Long Distance Charges,Total Revenue,Status ID,Satisfaction Score,Customer Status,Churn Category,Location ID,LoyaltyID,Tenure,Churn
0,1.0,90001.0,54492.0,,,,,,,,...,,,,,,,,,,
1,2.0,90002.0,44586.0,,,,,,,,...,,,,,,,,,,
2,3.0,90003.0,58198.0,,,,,,,,...,,,,,,,,,,
3,4.0,90004.0,67852.0,,,,,,,,...,,,,,,,,,,
4,5.0,90005.0,43019.0,,,,,,,,...,,,,,,,,,,


In [None]:
# Print rows with NaN values
rows_with_nan = df[df.isnull().any(axis=1)]
print("Rows with NaN values:")
print(rows_with_nan)

# Load the Customer-Churn.csv to get the customer IDs
customer_churn_df = pd.read_csv('../2_data/Customer-Churn.csv')

# Find overlapping rows based on customerID
overlapping_rows = df[df['CustomerID'].isin(customer_churn_df['customerID'])]
print("\nOverlapping rows with Customer-Churn.csv based on customerID:")
print(overlapping_rows)

Rows with NaN values:
        ID  Zip Code  Population CustomerID  Count Country State City  \
0      1.0   90001.0     54492.0        NaN    NaN     NaN   NaN  NaN   
1      2.0   90002.0     44586.0        NaN    NaN     NaN   NaN  NaN   
2      3.0   90003.0     58198.0        NaN    NaN     NaN   NaN  NaN   
3      4.0   90004.0     67852.0        NaN    NaN     NaN   NaN  NaN   
4      5.0   90005.0     43019.0        NaN    NaN     NaN   NaN  NaN   
...    ...       ...         ...        ...    ...     ...   ...  ...   
43924  NaN       NaN         NaN        NaN    NaN     NaN   NaN  NaN   
43925  NaN       NaN         NaN        NaN    NaN     NaN   NaN  NaN   
43926  NaN       NaN         NaN        NaN    NaN     NaN   NaN  NaN   
43927  NaN       NaN         NaN        NaN    NaN     NaN   NaN  NaN   
43928  NaN       NaN         NaN        NaN    NaN     NaN   NaN  NaN   

      Lat Long  Latitude  ...  Total Long Distance Charges Total Revenue  \
0          NaN       NaN 

## 2. **Understaning the Data**

In [None]:
# table overview of the dataset
df

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn,tenure_group,TotalChargesPerMonth
0,7590-VHVEG,Female,0,Yes,No,-1.277445,No,No phone service,DSL,No,...,No,No,Month-to-month,Yes,Electronic check,-1.160323,-0.994242,No,0-12,-1.151302
1,5575-GNVDE,Male,0,No,No,0.066327,Yes,No,DSL,Yes,...,No,No,One year,No,Mailed check,-0.259629,-0.173244,No,24-48,-0.301458
2,3668-QPYBK,Male,0,No,No,-1.236724,Yes,No,DSL,Yes,...,No,No,Month-to-month,Yes,Mailed check,-0.362660,-0.959674,Yes,0-12,-0.350966
3,7795-CFOCW,Male,0,No,No,0.514251,No,No phone service,DSL,Yes,...,No,No,One year,No,Bank transfer (automatic),-0.746535,-0.194766,No,24-48,-0.786053
4,9237-HQITU,Female,0,No,No,-1.236724,Yes,No,Fiber optic,No,...,No,No,Month-to-month,Yes,Electronic check,0.197365,-0.940470,Yes,0-12,0.367602
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7038,6840-RESVB,Male,0,Yes,Yes,-0.340876,Yes,Yes,DSL,Yes,...,Yes,Yes,One year,Yes,Mailed check,0.665992,-0.128655,No,12-24,0.602583
7039,2234-XADUH,Female,0,Yes,Yes,1.613701,Yes,Yes,Fiber optic,No,...,Yes,Yes,One year,Yes,Credit card (automatic),1.277533,2.243151,No,60+,1.241035
7040,4801-JZAZL,Female,0,Yes,Yes,-0.870241,No,No phone service,DSL,Yes,...,No,No,Month-to-month,Yes,Electronic check,-1.168632,-0.854469,No,0-12,-1.096940
7041,8361-LTMKD,Male,1,Yes,No,-1.155283,Yes,Yes,Fiber optic,No,...,No,No,Month-to-month,Yes,Mailed check,0.320338,-0.872062,Yes,0-12,0.394858


In [None]:
# Display the first few rows of the dataset
print(df.head())

   customerID  gender  SeniorCitizen Partner Dependents  tenure PhoneService  \
0  7590-VHVEG  Female              0     Yes         No       1           No   
1  5575-GNVDE    Male              0      No         No      34          Yes   
2  3668-QPYBK    Male              0      No         No       2          Yes   
3  7795-CFOCW    Male              0      No         No      45           No   
4  9237-HQITU  Female              0      No         No       2          Yes   

      MultipleLines InternetService OnlineSecurity  ... DeviceProtection  \
0  No phone service             DSL             No  ...               No   
1                No             DSL            Yes  ...              Yes   
2                No             DSL            Yes  ...               No   
3  No phone service             DSL            Yes  ...              Yes   
4                No     Fiber optic             No  ...               No   

  TechSupport StreamingTV StreamingMovies        Contract Pape

### Initial Exploration of the Dataset

Upon examining the first few rows of the dataset using the `head()` method, we observe the following:

- **customerID**: Unique identifier for each customer.
- **gender**: Gender of the customer (Male/Female).
- **SeniorCitizen**: Indicates if the customer is a senior citizen (0: No, 1: Yes).
- **Partner**: Indicates if the customer has a partner (Yes/No).
- **Dependents**: Indicates if the customer has dependents (Yes/No).
- **tenure**: Number of months the customer has stayed with the company.
- **PhoneService**: Indicates if the customer has phone service (Yes/No).
- **MultipleLines**: Indicates if the customer has multiple lines (Yes/No/No phone service).
- **InternetService**: Type of internet service (DSL/Fiber optic/No).
- **OnlineSecurity**: Indicates if the customer has online security (Yes/No).
- **OnlineBackup**: Indicates if the customer has online backup (Yes/No).
- **DeviceProtection**: Indicates if the customer has device protection (Yes/No).
- **TechSupport**: Indicates if the customer has tech support (Yes/No).
- **StreamingTV**: Indicates if the customer has streaming TV (Yes/No).
- **StreamingMovies**: Indicates if the customer has streaming movies (Yes/No).
- **Contract**: Type of contract (Month-to-month/One year/Two year).
- **PaperlessBilling**: Indicates if the customer has paperless billing (Yes/No).
- **PaymentMethod**: Payment method used by the customer (Electronic check/Mailed check/Bank transfer (automatic)/Credit card (automatic)).
- **MonthlyCharges**: Monthly charges incurred by the customer.
- **TotalCharges**: Total charges incurred by the customer.
- **Churn**: Indicates if the customer has churned (Yes/No).

### Noteworthy Observations

- **TotalCharges**: This column is of type `object` instead of `float64`, which suggests that there might be some non-numeric values or missing values that need to be handled.
- **SeniorCitizen**: This column is of type `int64` but represents a binary categorical variable, which might be better represented as a category.
- **Churn**: This is the target variable indicating whether the customer has churned, which will be crucial for any predictive modeling.

These observations will guide the preprocessing steps, such as handling missing values, converting data types, and encoding categorical variables.

In [None]:
# concise summary of the DataFrame
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 23 columns):
 #   Column                Non-Null Count  Dtype   
---  ------                --------------  -----   
 0   customerID            7043 non-null   object  
 1   gender                7043 non-null   category
 2   SeniorCitizen         7043 non-null   category
 3   Partner               7043 non-null   category
 4   Dependents            7043 non-null   category
 5   tenure                7043 non-null   float64 
 6   PhoneService          7043 non-null   category
 7   MultipleLines         7043 non-null   category
 8   InternetService       7043 non-null   category
 9   OnlineSecurity        7043 non-null   category
 10  OnlineBackup          7043 non-null   category
 11  DeviceProtection      7043 non-null   category
 12  TechSupport           7043 non-null   category
 13  StreamingTV           7043 non-null   category
 14  StreamingMovies       7043 non-null   category
 15  Cont

rewrite this part after doing what is provided as ideas for analyzation, then evaluate

### Data Types and Conversion

**TotalCharges**: This column is of type object but likely contains numerical data. Investigate and convert it to a numeric type if appropriate, handling any non-numeric values.

**Categorical Variables**: High Cardinality: Columns like customerID have unique values for each entry and might not be useful for analysis.

**Binary Categories**: Columns like gender, Partner, Dependents, PhoneService, PaperlessBilling, and Churn are binary and can be easily analyzed for distribution and correlation.

**Numerical Variables**: SeniorCitizen, tenure, MonthlyCharges. Analyze the distribution, central tendency, and dispersion. Look for outliers and patterns.

**Correlation Analysis**: Check how these numerical variables correlate with each other and with the target variable Churn.

**issing Values**: Although info() shows no missing values, ensure there are no hidden missing values (e.g., empty strings or placeholders) in columns like TotalCharges.

**Distribution Analysis**: Histograms and Box Plots: For numerical columns to understand their distribution and identify outliers.
Bar Plots: For categorical columns to see the frequency distribution of each category.

**Relationships and Patterns**: 
- Pair Plots: To visualize relationships between numerical variables.
- Group By Analysis: Group data by categorical variables (e.g., Contract, PaymentMethod) to see how they affect numerical variables and Churn.


**Target Variable Analysis**:
- Churn: Analyze the distribution of the target variable Churn to understand the class balance. This is crucial for model building.

**Feature Engineering**: New Features: Create new features that might capture more information, such as tenure groups or interaction terms between categorical variables.

In [None]:
# Get statistical summaries of the numerical columns
print(df.describe())

             tenure  MonthlyCharges  TotalCharges  TotalChargesPerMonth
count  7.043000e+03    7.043000e+03  7.043000e+03          7.043000e+03
mean  -2.421273e-17   -6.406285e-17 -1.488074e-17          7.465592e-17
std    1.000071e+00    1.000071e+00  1.000071e+00          1.000071e+00
min   -1.318165e+00   -1.545860e+00 -9.991203e-01         -2.137475e+00
25%   -9.516817e-01   -9.725399e-01 -8.298459e-01         -9.597171e-01
50%   -1.372744e-01    1.857327e-01 -3.904632e-01          1.850696e-01
75%    9.214551e-01    8.338335e-01  6.642871e-01          8.416645e-01
max    1.613701e+00    1.794352e+00  2.826743e+00          1.873292e+00


Rewrite with own data provided in cell above

The majority of the customers are not senior citizens, as indicated by the mean value of 0.162 and the fact that the 25th, 50th, and 75th percentiles are all 0.
Only a small fraction (approximately 16.2%) of the customers are senior citizens.
The tenure of customers varies widely, with a standard deviation of 24.56.
The median tenure is 29 months, indicating that half of the customers have been with the company for less than 29 months.
The range of tenure is from 0 to 72 months, showing a broad distribution of customer loyalty.
The average monthly charge is $64.76, with a standard deviation of $30.09, indicating a wide range of charges.
The median monthly charge is $70.35, suggesting that half of the customers pay less than this amount.
The range of monthly charges is from $18.25 to $118.75, showing significant variability in the pricing.
SeniorCitizen: The most striking observation is the low proportion of senior citizens among the customers.
Tenure: The wide range and high standard deviation in tenure suggest diverse customer retention rates.
MonthlyCharges: The variability in monthly charges indicates a diverse customer base with different service levels or packages.



## 3. **Handle Missing Values**

In [None]:
# Identify columns with missing values
missing_values = df.isnull().sum()
print(missing_values)

# Since the dataset summary shows no missing values, let's check for any hidden missing values in 'TotalCharges'
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')

# Now check again for missing values after conversion
missing_values_after_conversion = df.isnull().sum()
print(missing_values_after_conversion)

# Handle missing values in 'TotalCharges' by filling with the median
df['TotalCharges'].fillna(df['TotalCharges'].median(), inplace=True)

# Verify that there are no more missing values
final_missing_values = df.isnull().sum()
print(final_missing_values)

customerID          0
gender              0
SeniorCitizen       0
Partner             0
Dependents          0
tenure              0
PhoneService        0
MultipleLines       0
InternetService     0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
PaperlessBilling    0
PaymentMethod       0
MonthlyCharges      0
TotalCharges        0
Churn               0
dtype: int64
customerID           0
gender               0
SeniorCitizen        0
Partner              0
Dependents           0
tenure               0
PhoneService         0
MultipleLines        0
InternetService      0
OnlineSecurity       0
OnlineBackup         0
DeviceProtection     0
TechSupport          0
StreamingTV          0
StreamingMovies      0
Contract             0
PaperlessBilling     0
PaymentMethod        0
MonthlyCharges       0
TotalCharges        11
Churn                0
dtype: int64
customerID          0
gender 

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['TotalCharges'].fillna(df['TotalCharges'].median(), inplace=True)


### Evaluation of Missing Values

The output of the missing values analysis reveals the following:

**Initial Check**: Initially, there were no missing values detected in any of the columns, including `TotalCharges`.

**After Conversion**: After converting the `TotalCharges` column to numeric, 11 missing values were identified. This indicates that there were non-numeric values in the `TotalCharges` column that were converted to `NaN`.

**Final Check**: After handling the missing values in `TotalCharges` by filling them with the median, there are no more missing values in the dataset.

- The initial check did not reveal any missing values, but the conversion of `TotalCharges` to numeric exposed 11 missing values.
- These missing values were successfully handled by filling them with the median value of the `TotalCharges` column.
- The final check confirms that there are no missing values in the dataset, ensuring that the data is now clean and ready for further analysis.

The Telco Customer Churn dataset contains information about a telecommunications company's customers and their subscription details. The dataset includes various features such as customer demographics, account information, and services subscribed. One of the columns, `TotalCharges`, represents the total amount charged to the customer.

#### Why Use the Median to Fill NaN Values in `TotalCharges`?

1. **Robustness to Outliers**:
   - The median is less affected by outliers and extreme values compared to the mean. In the context of `TotalCharges`, there could be customers with unusually high or low charges due to specific circumstances (e.g., promotional discounts, billing errors). Using the median ensures that these outliers do not skew the imputed values.

2. **Skewed Distribution**:
   - Financial data, such as `TotalCharges`, often has a skewed distribution. The mean can be heavily influenced by the skewness, leading to imputed values that do not represent the central tendency of the majority of the data. The median, being the middle value, provides a better central measure for skewed distributions.

3. **Consistency with Customer Behavior**:
   - The median represents the typical customer more accurately in many cases. For example, if most customers have moderate charges with a few having very high or very low charges, the median will reflect the typical customer’s charges better than the mean.

4. **Avoiding Extremes**:
   - Using the minimum or maximum values to fill NaN values would introduce extreme values that are not representative of the typical customer. This could distort analyses and models built on the data.

#### Example Scenario in the Telco Customer Churn Dataset

- **Mean**: If the dataset has a few customers with very high `TotalCharges` due to long tenure or high service usage, the mean would be higher than the typical customer’s charges. Imputing NaN values with the mean could overestimate the charges for customers with missing values.
- **Median**: The median provides a central value that is not influenced by the extreme high or low values, making it a more reasonable choice for imputation.
- **Min/Max**: Imputing with the minimum or maximum would introduce values that are not typical for most customers, leading to potential biases in the analysis.

Using the median to fill NaN values in the `TotalCharges` column of the Telco Customer Churn dataset is a reasonable approach because it provides a robust measure of central tendency that is not influenced by outliers or skewed distributions. This ensures that the imputed values are representative of the typical customer, leading to more accurate and reliable analyses and models.

## 4. **Convert Data Types**

In [None]:
# Convert categorical columns to 'category' data type
categorical_columns = ['gender', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines', 'InternetService', 
                       'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 
                       'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod', 'Churn']

for col in categorical_columns:
    df[col] = df[col].astype('category')

# Ensure numerical columns are in the correct format
df['SeniorCitizen'] = df['SeniorCitizen'].astype('int64')
df['tenure'] = df['tenure'].astype('int64')
df['MonthlyCharges'] = df['MonthlyCharges'].astype('float64')
df['TotalCharges'] = df['TotalCharges'].astype('float64')

# Verify the changes
print(df.dtypes)

customerID            object
gender              category
SeniorCitizen          int64
Partner             category
Dependents          category
tenure                 int64
PhoneService        category
MultipleLines       category
InternetService     category
OnlineSecurity      category
OnlineBackup        category
DeviceProtection    category
TechSupport         category
StreamingTV         category
StreamingMovies     category
Contract            category
PaperlessBilling    category
PaymentMethod       category
MonthlyCharges       float64
TotalCharges         float64
Churn               category
dtype: object


Converting data to categorical datatypes can be beneficial for several reasons:

1. **Memory Efficiency**: Categorical data often takes up less memory compared to storing the same data as strings or integers. This is because categorical data is stored as integer codes with a corresponding mapping to the category names.

2. **Performance Improvement**: Many machine learning algorithms and data processing operations can be more efficient when working with categorical data. Operations like comparisons and aggregations can be faster.

3. **Data Integrity**: Using categorical datatypes helps ensure that only valid categories are used in the data. This can prevent errors and inconsistencies.

4. **Statistical Analysis**: Categorical data can be useful for statistical analysis, allowing for easier grouping and summarization of data.

5. **Model Interpretability**: In machine learning, categorical features can improve model interpretability by clearly defining distinct groups or categories within the data.

Overall, converting to categorical datatypes helps in optimizing both the storage and processing of data, making it a common practice in data preprocessing.

## 5. **Encode Categorical Variables**

In [None]:
# Use one-hot encoding for categorical variables
df_encoded = pd.get_dummies(df, columns=categorical_columns, drop_first=True)

# Display the first few rows of the encoded DataFrame
print(df_encoded.head())

   customerID  SeniorCitizen  tenure  MonthlyCharges  TotalCharges  \
0  7590-VHVEG              0       1           29.85         29.85   
1  5575-GNVDE              0      34           56.95       1889.50   
2  3668-QPYBK              0       2           53.85        108.15   
3  7795-CFOCW              0      45           42.30       1840.75   
4  9237-HQITU              0       2           70.70        151.65   

   gender_Male  Partner_Yes  Dependents_Yes  PhoneService_Yes  \
0        False         True           False             False   
1         True        False           False              True   
2         True        False           False              True   
3         True        False           False             False   
4        False        False           False              True   

   MultipleLines_No phone service  ...  StreamingTV_Yes  \
0                            True  ...            False   
1                           False  ...            False   
2          

look up wether it makes sense to input this information here and wether it really answers question about categorical values

### Conclusions from the Output Data

1. **Customer Tenure and Charges**:
    - The `tenure` of customers varies significantly, ranging from 1 to 45 months in the sample.
    - `MonthlyCharges` and `TotalCharges` also show a wide range, indicating diverse service usage and billing amounts among customers.

2. **Gender Distribution**:
    - The `gender_Male` column indicates the gender of the customers, with both male and female customers represented in the dataset.

3. **Partner and Dependents**:
    - The `Partner_Yes` and `Dependents_Yes` columns show whether customers have partners or dependents. In the sample, some customers have partners, but none have dependents.

4. **Phone and Internet Services**:
    - The `PhoneService_Yes` and `InternetService_Fiber optic` columns indicate whether customers have phone and fiber optic internet services. There is a mix of customers with and without these services.

5. **Streaming Services**:
    - The `StreamingTV_Yes` and `StreamingMovies_Yes` columns show whether customers subscribe to streaming TV and movies services. In the sample, none of the customers have these services.

6. **Contract Types**:
    - The `Contract_One year` and `Contract_Two year` columns indicate the type of contract customers have. The sample includes customers with month-to-month, one-year, and two-year contracts.

7. **Billing and Payment Methods**:
    - The `PaperlessBilling_Yes` column shows whether customers use paperless billing. Some customers use paperless billing, while others do not.
    - The `PaymentMethod_*` columns indicate the payment methods used by customers, including electronic check, mailed check, and credit card.

8. **Churn**:
    - The `Churn_Yes` column indicates whether customers have churned. In the sample, there are both churned and non-churned customers.

### Noteworthy Observations

- **Diverse Customer Profiles**: The dataset includes a diverse range of customer profiles in terms of tenure, charges, services subscribed, and contract types.
- **Churn Analysis**: The presence of both churned and non-churned customers allows for analysis of factors contributing to customer churn.
- **Service Usage**: The variability in service usage (e.g., phone, internet, streaming) can provide insights into customer preferences and potential areas for service improvement.
- **Billing and Payment Preferences**: Understanding billing and payment preferences can help tailor customer service and billing processes to improve customer satisfaction and retention.


## 6. **Feature Engineering**

In [None]:
import numpy as np
from sklearn.preprocessing import StandardScaler

# Create a new feature for tenure groups
df['tenure_group'] = pd.cut(df['tenure'], bins=[0, 12, 24, 48, 60, np.inf], labels=['0-12', '12-24', '24-48', '48-60', '60+'])

# Create a new feature for total charges per month
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')  # Ensure TotalCharges is numeric
df['TotalChargesPerMonth'] = df['TotalCharges'] / df['tenure']
df['TotalChargesPerMonth'].replace([np.inf, -np.inf], np.nan, inplace=True)  # Replace infinite values with NaN
df['TotalChargesPerMonth'].fillna(0, inplace=True)  # Handle division by zero for tenure = 0

# Convert SeniorCitizen to a categorical feature
df['SeniorCitizen'] = df['SeniorCitizen'].astype('category')

# Normalize numerical features
scaler = StandardScaler()
numerical_features = ['tenure', 'MonthlyCharges', 'TotalCharges', 'TotalChargesPerMonth']
df[numerical_features] = scaler.fit_transform(df[numerical_features])

# Verify the changes
print(df.head())

   customerID  gender SeniorCitizen Partner Dependents    tenure PhoneService  \
0  7590-VHVEG  Female             0     Yes         No -1.277445           No   
1  5575-GNVDE    Male             0      No         No  0.066327          Yes   
2  3668-QPYBK    Male             0      No         No -1.236724          Yes   
3  7795-CFOCW    Male             0      No         No  0.514251           No   
4  9237-HQITU  Female             0      No         No -1.236724          Yes   

      MultipleLines InternetService OnlineSecurity  ... StreamingTV  \
0  No phone service             DSL             No  ...          No   
1                No             DSL            Yes  ...          No   
2                No             DSL            Yes  ...          No   
3  No phone service             DSL            Yes  ...          No   
4                No     Fiber optic             No  ...          No   

  StreamingMovies        Contract PaperlessBilling              PaymentMethod  \
0    

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['TotalChargesPerMonth'].replace([np.inf, -np.inf], np.nan, inplace=True)  # Replace infinite values with NaN
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['TotalChargesPerMonth'].fillna(0, inplace=True)  # Handle division by zero for tenure = 0


rewrite code with method that is consistent with python 3.0, then rewrite this text


### Additional Features Created

1. **tenure_group**:
    - **Description**: This feature categorizes the `tenure` of customers into groups: '0-12', '12-24', '24-48', '48-60', and '60+' months.
    - **Reason for Creation**: Grouping the tenure into categories helps in understanding the distribution of customer loyalty over time. It can reveal patterns in customer retention and identify which tenure groups are more likely to churn.

2. **TotalChargesPerMonth**:
    - **Description**: This feature is calculated by dividing the `TotalCharges` by the `tenure`, representing the average charges per month for each customer.
    - **Reason for Creation**: This feature normalizes the total charges over the tenure period, providing a clearer picture of the monthly expenditure of customers. It helps in identifying customers who might be paying more or less on average per month, which can be crucial for understanding customer satisfaction and predicting churn.

### Importance for Analysis/Prediction

- **tenure_group**: By categorizing the tenure, we can perform more granular analysis on customer retention and churn rates across different tenure groups. This can help in identifying specific periods where customers are more likely to churn, allowing for targeted retention strategies.

- **TotalChargesPerMonth**: This feature provides insight into the average monthly spending of customers, which can be a significant factor in predicting churn. Customers with higher average monthly charges might have different satisfaction levels or service usage patterns compared to those with lower charges. This feature can help in segmenting customers based on their spending behavior and tailoring marketing or retention efforts accordingly.

These additional features enhance the dataset by providing more detailed and actionable insights, which are essential for effective analysis and predictive modeling.


## 7. **Handle Outliers**

In [None]:
# Identify outliers using the IQR method
def cap_outliers(df, column):
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    df[column] = np.where(df[column] < lower_bound, lower_bound, df[column])
    df[column] = np.where(df[column] > upper_bound, upper_bound, df[column])

# Apply the capping function to numerical features
for col in numerical_features:
    cap_outliers(df, col)

# Verify the changes
print(df[numerical_features].describe())

             tenure  MonthlyCharges  TotalCharges  TotalChargesPerMonth
count  7.043000e+03    7.043000e+03  7.043000e+03          7.043000e+03
mean  -2.421273e-17   -6.406285e-17 -1.488074e-17          7.465592e-17
std    1.000071e+00    1.000071e+00  1.000071e+00          1.000071e+00
min   -1.318165e+00   -1.545860e+00 -9.991203e-01         -2.137475e+00
25%   -9.516817e-01   -9.725399e-01 -8.298459e-01         -9.597171e-01
50%   -1.372744e-01    1.857327e-01 -3.904632e-01          1.850696e-01
75%    9.214551e-01    8.338335e-01  6.642871e-01          8.416645e-01
max    1.613701e+00    1.794352e+00  2.826743e+00          1.873292e+00



should we compare this with the similar table of describe() above?

### Handling Outliers

The outlier problem in the dataset was addressed using the Interquartile Range (IQR) method. Here's how the solution works:

1. **Identification of Outliers**:
    - For each numerical feature, the first quartile (Q1) and third quartile (Q3) were calculated.
    - The IQR was computed as the difference between Q3 and Q1.
    - Lower and upper bounds were determined using the formula:
      \[
      \text{Lower Bound} = Q1 - 1.5 \times \text{IQR}
      \]
      \[
      \text{Upper Bound} = Q3 + 1.5 \times \text{IQR}
      \]

2. **Capping Outliers**:
    - Values below the lower bound were capped at the lower bound.
    - Values above the upper bound were capped at the upper bound.

3. **Normalization**:
    - After capping the outliers, the numerical features were normalized using the `StandardScaler` from scikit-learn. This transformation scales the features to have a mean of 0 and a standard deviation of 1.

### Explanation

- **Normalization**: The mean values of the features are close to zero, and the standard deviations are close to one, indicating successful normalization.
- **Capping**: The minimum and maximum values are within a reasonable range, showing that extreme outliers have been capped effectively.
- **Distribution**: The quartiles (25%, 50%, 75%) are well-distributed, suggesting that the data is now more robust for further analysis and modeling.

By capping the outliers and normalizing the data, we ensure that the numerical features are on a similar scale and that extreme values do not disproportionately influence the analysis or predictive models.


## 8. **Split the Data**

In [None]:
from sklearn.model_selection import train_test_split

# Define the features and target variable
X = df_encoded.drop(columns=['customerID', 'Churn_Yes'])
y = df_encoded['Churn_Yes']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Verify the shapes of the resulting datasets
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")

The last cell splits the dataset into training and testing sets.

The dataset is split using a fixed random state to ensure reproducibility. The test size is set to 0.2, meaning 20% of the data will be used for testing, and the remaining 80% will be used for training. This split ratio is commonly used to provide a balance between having enough data to train the model and having sufficient data to evaluate its performance.

**Parameters**:
- X (array-like): Features dataset.
- y (array-like): Target variable.

**Returns**:
- X_train (array-like): Training features.
- X_test (array-like): Testing features.
- y_train (array-like): Training target variable.
- y_test (array-like): Testing target variable.
