# Step 4: Data Type Conversion

## Preprocessing Pipeline Overview

This preprocessing pipeline outlines the steps necessary to prepare the Telco Customer Churn dataset for our modeling. Each step is designed to address specific aspects of data quality, transformation, and feature creation. We will cover each step in a separate jupyter notebook file.

**Step 1: Data Loading**: Loading the datasets into the workspace, ensuring all necessary files are correctly imported for analysis. This includes the Kaggle dataset and the IBM datasets.

**Step 2: Dataset Integration**: Combining relevant datasets into a single, unified dataset that will serve as the foundation for subsequent analysis.

**Step 3: Handling Missing Values**: Identifying and addressing missing values in the dataset to ensure data integrity. This step ensures no significant gaps hinder the analysis.

**Step 4: Data Type Conversion**: Converting data columns to appropriate data types to optimize memory usage and prepare for feature engineering. Ensure consistency across all columns.

**Step 5: Data Exploration**: Perform initial exploratory data analysis (EDA) to understand the dataset's structure and characteristics, visualizing key features to gain insights into the data.

**Step 6: Feature Engineering**: Creating new features from the existing data to enhance model performance and capture additional insights. This includes transformations and derived features.

**Step 7: Outlier Detection**: Identifying and addressing outliers in the dataset to ensure they do not negatively impact the analysis or models.

**Step 8: Clustering Customers**: Identifying the most common customer profiles via clustering.

**Step 8: Dataset Splitting**: Splitting the dataset into training and testing subsets to prepare for model development and evaluation. This step ensures reproducibility and robust performance metrics.

## 4.1: Inspecting data types


Before proceeding with data type conversion, let's first inspect the data types of the columns in the `telcocustomerchurn_combined.csv` dataset. This will help us identify which columns need to be converted to appropriate data types.

In [101]:
# Load the dataset
import pandas as pd

# Assuming the dataset is in the same directory
df = pd.read_csv('../2_data/telcocustomerchurn_combined.csv')

# Set display options to show all rows and columns
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

print(df.dtypes)

Unnamed: 0                             int64
Customer ID                           object
Count                                  int64
Gender                                object
Age                                    int64
Under 30                              object
Senior Citizen                        object
Married                               object
Dependents                            object
Number of Dependents                   int64
Location ID                           object
Country                               object
State                                 object
City                                  object
Zip Code                               int64
Lat Long                              object
Latitude                             float64
Longitude                            float64
Service ID                            object
Quarter                               object
Referred a Friend                     object
Number of Referrals                    int64
Tenure in 

Now we want to provide general information on the data, that we extracted by analyzing the `telcocustomerchurn.csv`:

### **Meaning of All the Variables**
#### Churn Predict
- Loyalty ID:A unique identifier for each customer's loyalty program membership
- Customer ID:A unique identifier for each customer. 
- Senior Citizen:Indicates if the customer is a senior citizen (Yes/No).
- Partner: Indicates if the customer has a partner or spouse (Yes/No).
- Dependents: Indicates if the customer has family member (Yes/No).
- Tenure:The number of months the customer has stayed with the company.
- Phone Service: Indicates if the customer has phone service (Yes/No).
- Multiple Lines:Indicates if the customer has multiple phone lines (Yes/No).
- Internet Service: Type of internet service the customer has (DSL, Fiber optic, No)
- Online Security: Indicates if the customer has subscribed to online security services (Yes/No).
- Online Backup: Indicates if the customer has subscribed to online backup services (Yes/No). 
- Device Protection:  Indicates if the customer has subscribed to device protection services (Yes/No).
- Tech Support: Indicates if the customer has subscribed to technical support services (Yes/No).
- Streaming TV: Indicates if the customer has subscribed to streaming TV services (Yes/No).
- Streaming Movies: Indicates if the customer has subscribed to streaming movie services (Yes/No). 
- Contract: Type of contract the customer has (Month-to-month, One year, Two year).
- Paperless Billing: Indicates if the customer has opted for paperless billing (Yes/No).
- Payment Method: The payment method used by the customer (Bank transfer, Credit card, Electronic check, Mailed check). 
- Monthly Charges: The amount charged to the customer every month. 
- Total Charges: The total amount charged to the customer for the entire tenure
- Churn:  Indicates if the customer has discontinued the service (Yes/No). 

#### Demographics
- CustomerID: A unique ID that identifies each customer.
- Count: A value used in reporting/dashboarding to sum up the number of customers in a filtered set.~~(All 1, could be delete derictly)~~
- Gender: The customer’s gender (Male/Female)
- Age: The customer’s current age, in years, at the time the fiscal quarter ended.
- Senior Citizen: Indicates if the customer is 65 or older (Yes/No)
- Married: Indicates if the customer is married (Yes/No)
- Dependents: Indicates if the customer lives with any dependents: (Yes/No).
- Number of Dependents: Indicates the number of dependents that live with the customer.

#### Location:(?Precision)

- Location ID:A unique ID that identifies each location
- Customer ID: A unique ID that identifies each customer.
- Count: A value used in reporting/dashboarding to sum up the number of customers in a filtered set.
- Country: The country of the customer’s primary residence.~~(All US, could delete directly)~~
- State: The state of the customer’s primary residence.~~(All California, could be deleted directly)~~
- City: The city of the customer’s primary residence.
- Zip Code: The zip code of the customer’s primary residence.~~(strongly correlated to city?)~~
- Lat Long: The combined latitude and longitude of the customer’s primary residence.
- Latitude: The latitude of the customer’s primary residence.
- Longitude: The longitude of the customer’s primary residence.


#### Population: show the population of each zip code place
- ID: A unique ID that identifies each row.
- Zip Code: The zip code of the customer’s primary residence.
- Population: A current population estimate for the entire Zip Code area.

#### Services
- Service ID: A unique ID that identifies each types of service a customer subscribes to.
- CustomerID: A unique ID that identifies each customer.
- Count: A value used in reporting/dashboarding to sum up the number of customers in a filtered set.(All 1)
- Quarter: The fiscal quarter that the data has been derived from (All Q3).
- Referred a Friend: Indicates if the customer has ever referred a friend or family member to this company (Yes/No)
- Number of Referrals: Indicates the number of referrals to date that the customer has made.
- Tenure in Months: Indicates the total amount of months that the customer has been with the company by the end of the quarter specified above.
- Offer: Identifies the last marketing offer that the customer accepted, if applicable (Values include None/ Offer A/ Offer B/Offer C/Offer D/Offer E)
- Phone Service: Indicates if the customer subscribes to home phone service with the company (Yes/No)
- Avg Monthly Long Distance Charges: Indicates the customer’s average long distance charges, calculated to the end of the quarter specified above.
- Multiple Lines: Indicates if the customer subscribes to multiple telephone lines with the company(Yes/No)
- Internet Service: Indicates if the customer subscribes to Internet service with the company(Yes/No)
- Internet Type:Type of internet service the customer has(Fiber Optic/Cable/DSL/None)
- Avg Monthly GB Download: Indicates the customer’s average download volume in gigabytes, calculated to the end of the quarter specified above.
- Online Security: Indicates if the customer subscribes to an additional online security service provided by the company: (Yes/No)
- Online Backup: Indicates if the customer subscribes to an additional online backup service provided by the company: (Yes/No)
- Device Protection Plan: Indicates if the customer subscribes to an additional device protection plan for their Internet equipment provided by the company (Yes/No)
- Premium Tech Support: Indicates if the customer subscribes to an additional technical support plan from the company with reduced wait times (Yes/No)
- Streaming TV: Indicates if the customer uses their Internet service to stream television programing from a third party provider: (Yes/No)
- Streaming Movies: Indicates if the customer uses their Internet service to stream movies from a third party provider: (Yes/No)
- Streaming Music: Indicates if the customer uses their Internet service to stream music from a third party provider: (Yes/No) 
- Unlimited Data: Indicates if the customer has paid an additional monthly fee to have unlimited data downloads/uploads: (Yes/No)
- Contract: Indicates the customer’s current contract type: (Month-to-Month/One Year/Two Year)
- Paperless Billing: Indicates if the customer has chosen paperless billing: (Yes/No)
- Payment Method: Indicates how the customer pays their bill: (Bank Withdrawal/Credit Card/Mailed Check)
- Monthly Charge: Indicates the customer’s current total monthly charge for all their services from the company.
- Total Charges: Indicates the customer’s total charges, calculated to the end of the quarter specified above.
- Total Refunds: Indicates the customer’s total refunds, calculated to the end of the quarter specified above.
- Total Extra Data Charges: Indicates the customer’s total charges for extra data downloads above those specified in their plan, by the end of the quarter specified above.
- Total Long Distance Charges: Indicates the customer’s total charges for long distance above those specified in their plan, by the end of the quarter specified above.
- Total Revenue:Total Revenue= `Total Charges`- `Total Refunds`+`Total Extra Data Charges`+`Total Long Distance Charges`
#### Status
- Status ID:An unique ID that identifies each customer's statue
- CustomerID: A unique ID that identifies each customer.
- Count: A value used in reporting/dashboarding to sum up the number of customers in a filtered set.(ALL 1)
- Quarter: The fiscal quarter that the data has been derived from (ALL Q3).
- Satisfaction Score: A customer’s overall satisfaction rating of the company from 1 (Very Unsatisfied) to 5 (Very Satisfied).(1-5)
- Customer Status: Indicates the status of the customer at the end of the quarter: (Churned/Stayed/Joined)
- Churn Label: Yes = the customer left the company this quarter. No = the customer remained with the company. Directly related to Churn Value.(Yes/No)
- Churn Value: 1 = the customer left the company this quarter. 0 = the customer remained with the company. Directly related to Churn Label.(1/0)
- Churn Score: A value from 0-100 that is calculated using the predictive tool IBM SPSS Modeler. The model incorporates multiple factors known to cause churn. The higher the score, the more likely the customer will churn.(1-100)
- Churn Score Category: A calculation that assigns a Churn Score to one of the following categories: 0-10, 11-20, 21-30, 31-40, 41-50, 51-60, 61-70, 71-80, 81-90, and 91-100(Numerical to nominal,equal-interval binning)
- CLTV: Customer Lifetime Value. A predicted CLTV is calculated using corporate formulas and existing data. The higher the value, the more valuable the customer. High value customers should be monitored for churn.
- CLTV Category: A calculation that assigns a CLTV value to one of the following categories: 2000-2500, 2501-3000, 3001-3500, 3501-4000, 4001-4500, 4501-5000, 5001-5500, 5501-6000, 6001-6500, and 6501-7000.

- Churn Category: A high-level category for the customer’s reason for churning: Attitude, Competitor, Dissatisfaction, Other, Price. When they leave the company, all customers are asked about their reasons for leaving. Directly related to Churn Reason.（Attitude/Competitor/Dissatisfaction/Price/Other)
- Churn Reason: A customer’s specific reason for leaving the company. Directly related to Churn Category.

In [102]:
for col in df.select_dtypes(include=['object']).columns:
    if df[col].nunique() < 50:  
        df[col] = df[col].astype('category')

for col in df.columns:
    print(f"{col}: {df[col].dtype}")

Unnamed: 0: int64
Customer ID: object
Count: int64
Gender: category
Age: int64
Under 30: category
Senior Citizen: category
Married: category
Dependents: category
Number of Dependents: int64
Location ID: object
Country: category
State: category
City: object
Zip Code: int64
Lat Long: object
Latitude: float64
Longitude: float64
Service ID: object
Quarter: category
Referred a Friend: category
Number of Referrals: int64
Tenure in Months: int64
Offer: category
Phone Service: category
Avg Monthly Long Distance Charges: float64
Multiple Lines: category
Internet Service: category
Internet Type: category
Avg Monthly GB Download: int64
Online Security: category
Online Backup: category
Device Protection Plan: category
Premium Tech Support: category
Streaming TV: category
Streaming Movies: category
Streaming Music: category
Unlimited Data: category
Contract: category
Paperless Billing: category
Payment Method: category
Monthly Charge: float64
Total Charges: float64
Total Refunds: float64
Total Extr

In [103]:
category_columns = []
int_columns = []
float_columns = []
object_columns = []
category_columns = []

for col in df.columns:
    dtype = df[col].dtype
    if dtype == 'object':
        object_columns.append(col)
    elif dtype == 'int64':
        int_columns.append(col)
    elif dtype == 'float64':
        float_columns.append(col)
    else:
        category_columns.append(col)  

print("Integer Columns (int64):")
for col in int_columns:
    print(f" - {col}")

print("\nFloat Columns (float64):")
for col in float_columns:
    print(f" - {col}")

print("\nObject Columns (object):")
for col in object_columns:
    print(f" - {col}")

print("\nCategory Columns (category):")
for col in category_columns:
    print(f" - {col}")

Integer Columns (int64):
 - Unnamed: 0
 - Count
 - Age
 - Number of Dependents
 - Zip Code
 - Number of Referrals
 - Tenure in Months
 - Avg Monthly GB Download
 - Total Extra Data Charges
 - Satisfaction Score
 - Churn Value
 - Churn Score
 - CLTV
 - LoyaltyID
 - Tenure

Float Columns (float64):
 - Latitude
 - Longitude
 - Avg Monthly Long Distance Charges
 - Monthly Charge
 - Total Charges
 - Total Refunds
 - Total Long Distance Charges
 - Total Revenue
 - Monthly Charges

Object Columns (object):
 - Customer ID
 - Location ID
 - City
 - Lat Long
 - Service ID
 - Status ID

Category Columns (category):
 - Gender
 - Under 30
 - Senior Citizen
 - Married
 - Dependents
 - Country
 - State
 - Quarter
 - Referred a Friend
 - Offer
 - Phone Service
 - Multiple Lines
 - Internet Service
 - Internet Type
 - Online Security
 - Online Backup
 - Device Protection Plan
 - Premium Tech Support
 - Streaming TV
 - Streaming Movies
 - Streaming Music
 - Unlimited Data
 - Contract
 - Paperless Billin

### Data Type Conversion Explanation

In our data type conversion process, we only need to focus on object data types because:

**Integer Columns (int64)**
- **Description**: Integer columns contain whole numbers.
- **Conversion**: Integer data types are straightforward and typically do not require complex conversion logic. They can be easily cast to other numeric types if needed.

**Float Columns (float64)**
- **Description**: Float columns contain decimal numbers.
- **Conversion**: Similar to integers, float data types are also straightforward. They can be cast to integers or other numeric types with simple type casting.

**Object Columns (object)**
- **Description**: Object columns can contain mixed data types, including strings, dates, and more complex structures.
- **Conversion**: Object data types require specific handling because they can hold complex and nested structures. Proper conversion ensures that the integrity of the data is maintained, especially when dealing with strings, dates, or custom objects.

Category Columns (category)
- **Description**: Category columns contain categorical data, which is often represented as strings but stored more efficiently.
- **Conversion**: While category data types are efficient, they are typically derived from object data types. The conversion logic for categories is usually simpler once the object data types are correctly handled.

In the following data type conversion we will focus on the category data type.

In [104]:
for col in category_columns:
    unique_values = df[col].unique()
    num_unique_values = len(unique_values)
    
    if num_unique_values == df.shape[0]:
        print(f"Column '{col}' has all unique values.")
    else:
        print(f"Column '{col}' has {num_unique_values} unique values.")
        if num_unique_values <= 50:  # Display unique values if they are 10 or less
            print(f"Unique values in '{col}': {unique_values}")
        else:
            print(f"Sample unique values in '{col}': {unique_values[:10]}")

Column 'Gender' has 2 unique values.
Unique values in 'Gender': ['Female', 'Male']
Categories (2, object): ['Female', 'Male']
Column 'Under 30' has 2 unique values.
Unique values in 'Under 30': ['No', 'Yes']
Categories (2, object): ['No', 'Yes']
Column 'Senior Citizen' has 2 unique values.
Unique values in 'Senior Citizen': ['No', 'Yes']
Categories (2, object): ['No', 'Yes']
Column 'Married' has 2 unique values.
Unique values in 'Married': ['Yes', 'No']
Categories (2, object): ['No', 'Yes']
Column 'Dependents' has 2 unique values.
Unique values in 'Dependents': ['No', 'Yes']
Categories (2, object): ['No', 'Yes']
Column 'Country' has 1 unique values.
Unique values in 'Country': ['United States']
Categories (1, object): ['United States']
Column 'State' has 1 unique values.
Unique values in 'State': ['California']
Categories (1, object): ['California']
Column 'Quarter' has 1 unique values.
Unique values in 'Quarter': ['Q3']
Categories (1, object): ['Q3']
Column 'Referred a Friend' has 2 u

### Summary of Data Type Conversion for Object Attributes

Based on the analysis of the object attributes in the dataset, here is a summary of the proposed data type conversions, grouped by the target data type:

**Convert to Boolean**: (2 unique values: 'No', 'Yes')
Gender, Under 30, Senior Citizen, Married, Dependents, Referred a Friend, Phone Service, Multiple Lines, Internet Service, Online Security, Online Backup, Device Protection Plan, Premium Tech Support, Streaming TV, Streaming Movies, Streaming Music, Unlimited Data, Paperless Billing, Churn, Partner

#### Convert to Category
1. **Country**: (1 unique value: 'United States')
2. **State**: (1 unique value: 'California')
3. **City**: (1106 unique values)
4. **Quarter**: (1 unique value: 'Q3')
5. **Offer**: (6 unique values: nan, 'Offer E', 'Offer D', 'Offer A', 'Offer B', 'Offer C')
6. **Internet Type**: (4 unique values: 'Cable', 'Fiber Optic', 'DSL', nan)
7. **Contract**: (3 unique values: 'One Year', 'Month-to-Month', 'Two Year')
8. **Payment Method**: (3 unique values: 'Credit Card', 'Bank Withdrawal', 'Mailed Check')
9. **Customer Status**: (3 unique values: 'Stayed', 'Churned', 'Joined')
10. **Churn Category**: (6 unique values: nan, 'Competitor', 'Dissatisfaction', 'Other', 'Price', 'Attitude')
11. **Churn Reason**: (21 unique values)
12. **Device Protection**: (3 unique values: 'No', 'Yes', 'No internet service')
13. **Tech Support**: (3 unique values: 'Yes', 'No', 'No internet service')

In [105]:
from sklearn.preprocessing import LabelEncoder

yes_no_columns = [col for col in df.columns if df[col].nunique() == 2]
print(yes_no_columns)

remaining_category_columns = [col for col in category_columns if col not in yes_no_columns]
print(remaining_category_columns)

['Gender', 'Under 30', 'Senior Citizen', 'Married', 'Dependents', 'Referred a Friend', 'Phone Service', 'Multiple Lines', 'Internet Service', 'Online Security', 'Online Backup', 'Device Protection Plan', 'Premium Tech Support', 'Streaming TV', 'Streaming Movies', 'Streaming Music', 'Unlimited Data', 'Paperless Billing', 'Churn Value', 'Partner', 'Churn']
['Country', 'State', 'Quarter', 'Offer', 'Internet Type', 'Contract', 'Payment Method', 'Customer Status', 'Churn Category', 'Churn Reason', 'Device Protection', 'Tech Support']


## 4.2 Encode Categorical Variables

We will use `LabelEncoder` from the `sklearn.preprocessing` module to transform categorical values with only two possible values (e.g., "yes" or "no") into numerical values. This process is known as label encoding.

### Reasons for Transforming Categorical Values:
1. **Machine Learning Compatibility**: Many machine learning algorithms require numerical input. By converting categorical values into numerical values, we ensure that the data can be used with a wide range of machine learning models.
2. **Simplification**: Binary categorical values (e.g., "yes" or "no") are straightforward to encode. Label encoding transforms these values into 0 and 1, simplifying the data representation.
3. **Efficiency**: Numerical values are more efficient to process and store compared to strings. This can lead to performance improvements during model training and prediction.
4. **Consistency**: Using a consistent method like `LabelEncoder` ensures that the transformation is applied uniformly across the dataset, reducing the risk of errors or inconsistencies.
5. **Interpretability**: The transformation from "yes" to 1 and "no" to 0 is intuitive and easy to understand, making the data more interpretable for analysis and debugging.


For each column in the list of binary categorical columns (`yes_no_columns`), the `fit_transform` method of `LabelEncoder` is applied to convert the categorical values into numerical values.
**Result**: The transformed columns in the DataFrame now contain numerical values (0 and 1) instead of the original categorical values.

In [106]:
# LabelEncoder
le = LabelEncoder()

# Label Encoding
for col in yes_no_columns:
    df[col] = le.fit_transform(df[col])

print(df[yes_no_columns].head())

   Gender  Under 30  Senior Citizen  Married  Dependents  Referred a Friend  \
0       0         0               0        1           0                  1   
1       1         0               0        0           0                  0   
2       1         0               0        0           0                  0   
3       1         0               1        1           0                  1   
4       0         0               1        1           0                  1   

   Phone Service  Multiple Lines  Internet Service  Online Security  \
0              1               0                 1                0   
1              1               1                 1                0   
2              1               0                 1                0   
3              1               0                 1                0   
4              1               0                 1                0   

   Online Backup  Device Protection Plan  Premium Tech Support  Streaming TV  \
0              1  

### One-Hot Encoding for Remaining Category Columns

For categorical variables with more than two unique values, we will use one-hot encoding. One-hot encoding transforms each unique value in a categorical column into a separate binary column. This process is essential for machine learning algorithms that require numerical input.

**Columns to be One-Hot Encoded**:
1. **Country**: (1 unique value: 'United States')
2. **State**: (1 unique value: 'California')
3. **Quarter**: (1 unique value: 'Q3')
4. **Offer**: (6 unique values: nan, 'Offer E', 'Offer D', 'Offer A', 'Offer B', 'Offer C')
5. **Internet Type**: (4 unique values: 'Cable', 'Fiber Optic', 'DSL', nan)
6. **Contract**: (3 unique values: 'One Year', 'Month-to-Month', 'Two Year')
7. **Payment Method**: (3 unique values: 'Credit Card', 'Bank Withdrawal', 'Mailed Check')
8. **Customer Status**: (3 unique values: 'Stayed', 'Churned', 'Joined')
9. **Churn Category**: (6 unique values: nan, 'Competitor', 'Dissatisfaction', 'Other', 'Price', 'Attitude')
10. **Churn Reason**: (21 unique values)
11. **Device Protection**: (3 unique values: 'No', 'Yes', 'No internet service')
12. **Tech Support**: (3 unique values: 'Yes', 'No', 'No internet service')

### Reasons for One-Hot Encoding:
1. **Machine Learning Compatibility**: Many machine learning algorithms require numerical input. One-hot encoding ensures that categorical data can be used with these algorithms.
2. **Avoiding Ordinal Relationships**: One-hot encoding prevents the algorithm from assuming any ordinal relationship between the categories, which is crucial for non-ordinal categorical data.
3. **Improved Performance**: One-hot encoding can improve the performance of machine learning models by providing a clear and unambiguous representation of categorical data.

After applying one-hot encoding to the specified columns, the categorical variables will be transformed into multiple binary columns. Each unique value in a categorical column will become a separate binary column, where a value of 1 indicates the presence of that category and 0 indicates its absence. We can apply this here since our columns have 21 unique values or less, which is a reasonable number for one-hot encoding due to the potential increase in dimensionality.

In [107]:
# One-hot encoding
df_encoded = pd.get_dummies(df, columns=remaining_category_columns)

# Print only the one-hot encoded columns
one_hot_encoded_columns = df_encoded.columns.difference(df.columns)

# Convert boolean values to integers (0 and 1)
df_encoded[one_hot_encoded_columns] = df_encoded[one_hot_encoded_columns].astype(int)

# Print only the one-hot encoded columns
print(df_encoded[one_hot_encoded_columns].head())

   Churn Category_Attitude  Churn Category_Competitor  \
0                        0                          0   
1                        0                          0   
2                        0                          1   
3                        0                          0   
4                        0                          0   

   Churn Category_Dissatisfaction  Churn Category_Other  Churn Category_Price  \
0                               0                     0                     0   
1                               0                     0                     0   
2                               0                     0                     0   
3                               1                     0                     0   
4                               1                     0                     0   

   Churn Reason_Attitude of service provider  \
0                                          0   
1                                          0   
2                         


### Verifying Data Types

We will now check if all data types in the DataFrame are either `float` or `int`, except for the columns that were originally of `object` data type. This step ensures that our data is ready for machine learning algorithms that require numerical input.


In [108]:
# Check if all data types in the DataFrame are either float or int, except for the columns that were originally of object data type
non_numeric_columns = df_encoded.select_dtypes(exclude=['float64', 'int64']).columns
non_numeric_columns = [col for col in non_numeric_columns if col not in object_columns]

if len(non_numeric_columns) == 0:
    print("All columns are either float or int, except for the original object columns.")
else:
    print("The following columns are not float or int and were not originally object columns:")
    print(non_numeric_columns)

All columns are either float or int, except for the original object columns.


## Converting the Object Datatypes

In [109]:
# Save the encoded dataset to a CSV file
df_encoded.to_csv('../2_data/telcocustomerchurn_encoded.csv', index=False)
print("Object Columns (object):")
for col in object_columns:
    print(f" - {col}")

Object Columns (object):
 - Customer ID
 - Location ID
 - City
 - Lat Long
 - Service ID
 - Status ID


In [110]:
print(df[object_columns].head())

  Customer ID Location ID          City                Lat Long  \
0  0002-ORFBO  FUGQUJ6597  Frazier Park  34.827662, -118.999073   
1  0003-MKNFE  SIZFEJ5344      Glendale  34.162515, -118.203869   
2  0004-TLHLJ  RZDAXJ8786    Costa Mesa  33.645672, -117.922613   
3  0011-IGKFF  MGKGVM9555      Martinez  38.014457, -122.115432   
4  0013-EXCHZ  FJLSME1564     Camarillo  34.227846, -119.079903   

      Service ID   Status ID  
0  MJBAXYDAX5462  UAAWUJ8685  
1  NICWXTOGG9486  URNYXG9268  
2  DCSKWRXAI3251  LOOUCZ6174  
3  ZEOATALAE9483  HDYLOW1467  
4  MVMZRJAHU9423  EICWUI5128  


### Documentation

1. **Customer ID**: 
    - **Action**: Drop
    - **Reason**: This column is unique for each entry and does not provide additional value for analysis.

2. **Location ID**: 
    - **Action**: Drop
    - **Reason**: This column is unique and encoded, providing no additional value for analysis.

3. **City**: 
    - **Action**: Transform
    - **Reason**: Apply a method to transform text data for better analysis or modeling.

4. **Lat Long**: 
    - **Action**: Split into `Lat` and `Long`
    - **Reason**: Check if both `Lat` and `Long` are the same as `Latitude` and `Longitude` columns. If they are the same, drop the `Lat Long` column; otherwise, keep it.

5. **Service ID**: 
    - **Action**: Drop
    - **Reason**: This column is unique for each entry and does not provide additional value for analysis.

6. **Status ID**: 
    - **Action**: Drop
    - **Reason**: This column is unique for each entry and does not provide additional value for analysis.

In [111]:
# Drop columns that are unique for each entry and do not provide additional value for analysis
df_encoded.drop(columns=['Customer ID', 'Location ID', 'Service ID', 'Status ID'], inplace=True)

# Transform the 'City' column using a method suitable for text data transformation
# For simplicity, let's use Label Encoding for the 'City' column
df_encoded['City'] = le.fit_transform(df_encoded['City'])

# Split 'Lat Long' into 'Lat' and 'Long'
df_encoded[['Lat', 'Long']] = df_encoded['Lat Long'].str.split(', ', expand=True).astype(float)

# Check if 'Lat' and 'Long' are the same as 'Latitude' and 'Longitude' columns
if df_encoded['Lat'].equals(df_encoded['Latitude']) and df_encoded['Long'].equals(df_encoded['Longitude']):
    df_encoded.drop(columns=['Lat Long', 'Lat', 'Long'], inplace=True)
else:
    df_encoded.drop(columns=['Lat Long'], inplace=True)

print(df_encoded.head())

   Unnamed: 0  Count  Gender  Age  Under 30  Senior Citizen  Married  \
0        4006      1       0   37         0               0        1   
1        4788      1       1   46         0               0        0   
2        1901      1       1   50         0               0        0   
3         395      1       1   78         0               1        1   
4         368      1       0   75         0               1        1   

   Dependents  Number of Dependents  City  Zip Code   Latitude   Longitude  \
0           0                     0   346     93225  34.827662 -118.999073   
1           0                     0   368     91206  34.162515 -118.203869   
2           0                     0   222     92627  33.645672 -117.922613   
3           0                     0   587     94553  38.014457 -122.115432   
4           0                     0   139     93010  34.227846 -119.079903   

   Referred a Friend  Number of Referrals  Tenure in Months  Phone Service  \
0                  1

In [112]:
# Check if all columns in df_encoded are either int or float
non_numeric_columns = df_encoded.select_dtypes(exclude=['int64', 'float64']).columns

if len(non_numeric_columns) == 0:
    print("All columns in df_encoded are either int or float.")
else:
    print("The following columns in df_encoded are not int or float:")
    print(non_numeric_columns)

All columns in df_encoded are either int or float.


In [113]:
print(df_encoded.dtypes)

Unnamed: 0                                                  int64
Count                                                       int64
Gender                                                      int64
Age                                                         int64
Under 30                                                    int64
Senior Citizen                                              int64
Married                                                     int64
Dependents                                                  int64
Number of Dependents                                        int64
City                                                        int64
Zip Code                                                    int64
Latitude                                                  float64
Longitude                                                 float64
Referred a Friend                                           int64
Number of Referrals                                         int64
Tenure in 

In [114]:
# Save the encoded dataset to a CSV file
df_encoded.to_csv('../2_data/telcocustomerchurn_encoded.csv', index=False)
print("DataFrame saved as 'telcocustomerchurn_encoded.csv'")

DataFrame saved as 'telcocustomerchurn_encoded.csv'
