The dataset contains information about customers of a telecommunications
company and whether they have churned (i.e., discontinued their services). The dataset
includes various attributes of the customers, such as their demographics, usage patterns, and
account information. The goal is to perform data cleaning and preparation to gain insights
into the factors that contribute to customer churn.
Tasks to Perform:
1. Import the "Telecom_Customer_Churn.csv" dataset.
2. Explore the dataset to understand its structure and content.
3. Handle missing values in the dataset, deciding on an appropriate strategy.
4. Remove any duplicate records from the dataset.
5. Check for inconsistent data, such as inconsistent formatting or spelling variations,
and standardize it.
6. Convert columns to the correct data types as needed.
7. Identify and handle outliers in the data.
8. Perform feature engineering, creating new features that may be relevant to
predicting customer churn.
9. Normalize or scale the data if necessary.
10. Split the dataset into training and testing sets for further analysis.
11. Export the cleaned dataset for future analysis or modeling.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

df = pd.read_csv("telecom_customer_churn.csv")
df

Unnamed: 0,Customer ID,Gender,Age,Married,Number of Dependents,City,Zip Code,Latitude,Longitude,Number of Referrals,...,Payment Method,Monthly Charge,Total Charges,Total Refunds,Total Extra Data Charges,Total Long Distance Charges,Total Revenue,Customer Status,Churn Category,Churn Reason
0,0002-ORFBO,Female,37,Yes,0,Frazier Park,93225,34.827662,-118.999073,2,...,Credit Card,65.60,593.30,0.00,0,381.51,974.81,Stayed,,
1,0003-MKNFE,Male,46,No,0,Glendale,91206,34.162515,-118.203869,0,...,Credit Card,-4.00,542.40,38.33,10,96.21,610.28,Stayed,,
2,0004-TLHLJ,Male,50,No,0,Costa Mesa,92627,33.645672,-117.922613,0,...,Bank Withdrawal,73.90,280.85,0.00,0,134.60,415.45,Churned,Competitor,Competitor had better devices
3,0011-IGKFF,Male,78,Yes,0,Martinez,94553,38.014457,-122.115432,1,...,Bank Withdrawal,98.00,1237.85,0.00,0,361.66,1599.51,Churned,Dissatisfaction,Product dissatisfaction
4,0013-EXCHZ,Female,75,Yes,0,Camarillo,93010,34.227846,-119.079903,3,...,Credit Card,83.90,267.40,0.00,0,22.14,289.54,Churned,Dissatisfaction,Network reliability
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7038,9987-LUTYD,Female,20,No,0,La Mesa,91941,32.759327,-116.997260,0,...,Credit Card,55.15,742.90,0.00,0,606.84,1349.74,Stayed,,
7039,9992-RRAMN,Male,40,Yes,0,Riverbank,95367,37.734971,-120.954271,1,...,Bank Withdrawal,85.10,1873.70,0.00,0,356.40,2230.10,Churned,Dissatisfaction,Product dissatisfaction
7040,9992-UJOEL,Male,22,No,0,Elk,95432,39.108252,-123.645121,0,...,Credit Card,50.30,92.75,0.00,0,37.24,129.99,Joined,,
7041,9993-LHIEB,Male,21,Yes,0,Solana Beach,92075,33.001813,-117.263628,5,...,Credit Card,67.85,4627.65,0.00,0,142.04,4769.69,Stayed,,


In [2]:
print(df.head())  # View the first few rows
print(df.info())  # Get information about the dataset
print(df.describe())  # Descriptive statistics

  Customer ID  Gender  Age Married  Number of Dependents          City  \
0  0002-ORFBO  Female   37     Yes                     0  Frazier Park   
1  0003-MKNFE    Male   46      No                     0      Glendale   
2  0004-TLHLJ    Male   50      No                     0    Costa Mesa   
3  0011-IGKFF    Male   78     Yes                     0      Martinez   
4  0013-EXCHZ  Female   75     Yes                     0     Camarillo   

   Zip Code   Latitude   Longitude  Number of Referrals  ...   Payment Method  \
0     93225  34.827662 -118.999073                    2  ...      Credit Card   
1     91206  34.162515 -118.203869                    0  ...      Credit Card   
2     92627  33.645672 -117.922613                    0  ...  Bank Withdrawal   
3     94553  38.014457 -122.115432                    1  ...  Bank Withdrawal   
4     93010  34.227846 -119.079903                    3  ...      Credit Card   

  Monthly Charge Total Charges  Total Refunds Total Extra Data Charg

In [3]:
df.fillna(0, inplace=True)  # handling missing values

In [None]:
df.drop_duplicates(inplace=True) #removing duplicates

In [None]:
df['Gender'] = df['Gender'].str.lower()  # Example: Standardize gender to lowercase

In [None]:
df['TotalCharges'] = pd.to_numeric(df['Total Charges'], errors='coerce')  # Convert TotalCharges to numeric

In [None]:
# remove outliers from the TotalCharges column:
z_scores = (df['TotalCharges'] - df['Total Charges'].mean()) / df['Total Charges'].std()
df = df[(z_scores.abs() < 3)]

In [None]:
df['TenureinMonths'] = df['Tenure in Months'] * 30  # create new features - Convert tenure to months
df

In [None]:
scaler = StandardScaler()   #normalize the data
df[['MonthlyCharge', 'TotalCharges', 'TenureinMonths']] = scaler.fit_transform(df[['Monthly Charge', 'Total Charges', 'Tenure in Months']])

In [None]:
X = df.drop('Churn Category', axis=1)      #split data into training and testing
y = df['Churn Category']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
df.to_csv("Cleaned_Telecom_Customer_Churn.csv")    #export dataframe to csv

In [4]:
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Load the dataset
df = pd.read_csv("telecom_customer_churn.csv")

# View the first few rows and get dataset info
print("First 5 Rows of Data:\n", df.head())
print("\nDataset Information:\n", df.info())
print("\nDescriptive Statistics:\n", df.describe())

# Handle missing values by filling NaNs with 0
df.fillna(0, inplace=True)

# Remove duplicate rows
df.drop_duplicates(inplace=True)

# Standardize 'Gender' column to lowercase
df['Gender'] = df['Gender'].str.lower()

# Convert 'Total Charges' to numeric, with invalid parsing set as NaN (then filled with 0 above)
df['TotalCharges'] = pd.to_numeric(df['Total Charges'], errors='coerce')

# Remove outliers from 'TotalCharges' using Z-score method
z_scores = (df['TotalCharges'] - df['TotalCharges'].mean()) / df['TotalCharges'].std()
df = df[(z_scores.abs() < 3)]

# Create a new feature: convert 'Tenure in Months' to days for more granularity
df['TenureinDays'] = df['Tenure in Months'] * 30

# Initialize the StandardScaler to normalize data
scaler = StandardScaler()

# Apply scaling to selected columns
df[['MonthlyCharge', 'TotalCharges', 'TenureinDays']] = scaler.fit_transform(df[['Monthly Charge', 'TotalCharges', 'TenureinDays']])

# Split data into features (X) and target (y)
X = df.drop('Churn Category', axis=1)
y = df['Churn Category']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Export the cleaned and processed dataset to a new CSV file
df.to_csv("Cleaned_Telecom_Customer_Churn.csv", index=False)
print("\nCleaned dataset saved as 'Cleaned_Telecom_Customer_Churn.csv'")


First 5 Rows of Data:
   Customer ID  Gender  Age Married  Number of Dependents          City  \
0  0002-ORFBO  Female   37     Yes                     0  Frazier Park   
1  0003-MKNFE    Male   46      No                     0      Glendale   
2  0004-TLHLJ    Male   50      No                     0    Costa Mesa   
3  0011-IGKFF    Male   78     Yes                     0      Martinez   
4  0013-EXCHZ  Female   75     Yes                     0     Camarillo   

   Zip Code   Latitude   Longitude  Number of Referrals  ...   Payment Method  \
0     93225  34.827662 -118.999073                    2  ...      Credit Card   
1     91206  34.162515 -118.203869                    0  ...      Credit Card   
2     92627  33.645672 -117.922613                    0  ...  Bank Withdrawal   
3     94553  38.014457 -122.115432                    1  ...  Bank Withdrawal   
4     93010  34.227846 -119.079903                    3  ...      Credit Card   

  Monthly Charge Total Charges  Total Refunds