# ðŸ“Š Customer Churn Behavior in Telecom

| Name | Student ID |
|------|-------------|
| **Ali Refaat** | 58-4020 |
| **Ahmed Hassan** | 58-0671 |
| **Omar Sherif** | 58-1335 |
| **Ziad Ekramy** | 58-6936 |


## References 
The csv file will be zipped with the notebook and report. We will attach their urls here too.

Dataset: https://www.kaggle.com/datasets/blastchar/telco-customer-churn

Report: https://docs.google.com/document/d/1Glo1EXuVO4g1ExZ9Jak4fstznNyhMnjftuZWBc3Rrbs/edit?usp=sharing

In [1]:
import pandas as pd
import numpy as np
df = pd.read_csv("Telco-Customer.csv")

df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


## Analysis & Preprocessing of Input Attributes

In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


In [3]:
df.shape

(7043, 21)

There is a problem with the column TotalCharges, it is an object but it should be a float
The 2 lines under this markdown cell count 11 rows that are blank 

In [4]:
print("Unique entries in TotalCharges that are not numeric:")
print(df[pd.to_numeric(df["TotalCharges"], errors="coerce").isna()]["TotalCharges"].unique())


print("\nNumber of rows with non-numeric TotalCharges:")
print(df[pd.to_numeric(df["TotalCharges"], errors="coerce").isna()].shape[0])

Unique entries in TotalCharges that are not numeric:
[' ']

Number of rows with non-numeric TotalCharges:
11


We will drop these values

In [5]:
df["TotalCharges"] = pd.to_numeric(df["TotalCharges"], errors="coerce")
print("NaNs in TotalCharges after conversion:", df["TotalCharges"].isna().sum())
df = df.dropna(subset=["TotalCharges"])

NaNs in TotalCharges after conversion: 11


In [6]:
df.shape #confirms the removal of 11 rows

(7032, 21)

df.types shows that we have all the numeric values we need: 1) SeniorCitizen 2) tenure 3) MonthlyCharges ) TotalCharges

In [7]:
df.dtypes

customerID           object
gender               object
SeniorCitizen         int64
Partner              object
Dependents           object
tenure                int64
PhoneService         object
MultipleLines        object
InternetService      object
OnlineSecurity       object
OnlineBackup         object
DeviceProtection     object
TechSupport          object
StreamingTV          object
StreamingMovies      object
Contract             object
PaperlessBilling     object
PaymentMethod        object
MonthlyCharges      float64
TotalCharges        float64
Churn                object
dtype: object

We will now encode the categorical columns, and drop customerID

In [8]:
df = df.drop("customerID", axis=1)

In [9]:
categorical_cols = df.select_dtypes(include=["object"]).columns.tolist()
numeric_cols = df.select_dtypes(include=["int64", "float64"]).columns.tolist()

categorical_cols, numeric_cols


(['gender',
  'Partner',
  'Dependents',
  'PhoneService',
  'MultipleLines',
  'InternetService',
  'OnlineSecurity',
  'OnlineBackup',
  'DeviceProtection',
  'TechSupport',
  'StreamingTV',
  'StreamingMovies',
  'Contract',
  'PaperlessBilling',
  'PaymentMethod',
  'Churn'],
 ['SeniorCitizen', 'tenure', 'MonthlyCharges', 'TotalCharges'])

In [10]:
categorical_cols.remove("Churn")

In [11]:
df_encoded = pd.get_dummies(df, columns=categorical_cols, drop_first=True)

In [12]:
df_encoded.columns

Index(['SeniorCitizen', 'tenure', 'MonthlyCharges', 'TotalCharges', 'Churn',
       'gender_Male', 'Partner_Yes', 'Dependents_Yes', 'PhoneService_Yes',
       'MultipleLines_No phone service', 'MultipleLines_Yes',
       'InternetService_Fiber optic', 'InternetService_No',
       'OnlineSecurity_No internet service', 'OnlineSecurity_Yes',
       'OnlineBackup_No internet service', 'OnlineBackup_Yes',
       'DeviceProtection_No internet service', 'DeviceProtection_Yes',
       'TechSupport_No internet service', 'TechSupport_Yes',
       'StreamingTV_No internet service', 'StreamingTV_Yes',
       'StreamingMovies_No internet service', 'StreamingMovies_Yes',
       'Contract_One year', 'Contract_Two year', 'PaperlessBilling_Yes',
       'PaymentMethod_Credit card (automatic)',
       'PaymentMethod_Electronic check', 'PaymentMethod_Mailed check'],
      dtype='object')

We will now scale numeric columns

In [13]:
numeric_features_to_scale = ["SeniorCitizen", "tenure", "MonthlyCharges", "TotalCharges"]

In [14]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df_encoded[numeric_features_to_scale] = scaler.fit_transform(df_encoded[numeric_features_to_scale])

Here we check what we did in the previous blocks

In [15]:
print("===== DATAFRAME STATUS CHECK =====\n")

print("1) First 5 rows:")
display(df_encoded.head())

print("\n2) Shape of encoded dataframe:")
print(df_encoded.shape)

print("\n3) Columns and types:")
print(df_encoded.dtypes)

print("\n4) Checking if 'Churn' is still present (should be here, not encoded yet):")
print("Churn in df_encoded:", "Churn" in df_encoded.columns)

print("\n5) Checking if any Churn dummy columns were accidentally created:")
churn_related = [col for col in df_encoded.columns if "Churn" in col and col != "Churn"]
print("Churn-related columns (should be empty):", churn_related)

print("\n6) Checking if all categorical columns were encoded correctly:")
encoded_object_columns = df_encoded.select_dtypes(include=['object']).columns.tolist()
print("Object-type columns (should contain ONLY 'Churn'):", encoded_object_columns)

print("\n7) Checking numeric scaling (mean should be near 0):")
for col in ["SeniorCitizen", "tenure", "MonthlyCharges", "TotalCharges"]:
    mean_val = df_encoded[col].mean()
    print(f"{col}: mean = {mean_val:.4f}")

print("\n8) Checking for missing values after preprocessing (should be 0):")
print(df_encoded.isna().sum().sum(), "missing values")

print("\n===== STATUS CHECK COMPLETE =====")


===== DATAFRAME STATUS CHECK =====

1) First 5 rows:


Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges,TotalCharges,Churn,gender_Male,Partner_Yes,Dependents_Yes,PhoneService_Yes,MultipleLines_No phone service,...,StreamingTV_No internet service,StreamingTV_Yes,StreamingMovies_No internet service,StreamingMovies_Yes,Contract_One year,Contract_Two year,PaperlessBilling_Yes,PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check
0,-0.440327,-1.280248,-1.161694,-0.994194,No,False,True,False,False,True,...,False,False,False,False,False,False,True,False,True,False
1,-0.440327,0.064303,-0.260878,-0.17374,No,True,False,False,True,False,...,False,False,False,False,True,False,False,False,False,True
2,-0.440327,-1.239504,-0.363923,-0.959649,Yes,True,False,False,True,False,...,False,False,False,False,False,False,True,False,False,True
3,-0.440327,0.512486,-0.74785,-0.195248,No,True,False,False,False,True,...,False,False,False,False,True,False,False,False,False,False
4,-0.440327,-1.239504,0.196178,-0.940457,Yes,False,False,False,True,False,...,False,False,False,False,False,False,True,False,True,False



2) Shape of encoded dataframe:
(7032, 31)

3) Columns and types:
SeniorCitizen                            float64
tenure                                   float64
MonthlyCharges                           float64
TotalCharges                             float64
Churn                                     object
gender_Male                                 bool
Partner_Yes                                 bool
Dependents_Yes                              bool
PhoneService_Yes                            bool
MultipleLines_No phone service              bool
MultipleLines_Yes                           bool
InternetService_Fiber optic                 bool
InternetService_No                          bool
OnlineSecurity_No internet service          bool
OnlineSecurity_Yes                          bool
OnlineBackup_No internet service            bool
OnlineBackup_Yes                            bool
DeviceProtection_No internet service        bool
DeviceProtection_Yes                        bool
Tec

## Analysis of the Output Attributes (Churn)

In [16]:
df_encoded["Churn"] = df["Churn"].map({"Yes": 1, "No": 0})

In [17]:
print("Unique Churn values:", df_encoded["Churn"].unique()) #check unique values after encoding

Unique Churn values: [0 1]


In [18]:
print("Churn counts:")
print(df_encoded["Churn"].value_counts()) 
print("==================================")
print("Churn percentages:")
print(df_encoded["Churn"].value_counts(normalize=True) * 100)

Churn counts:
Churn
0    5163
1    1869
Name: count, dtype: int64
Churn percentages:
Churn
0    73.421502
1    26.578498
Name: proportion, dtype: float64


In [19]:
numeric_cols = ["SeniorCitizen", "tenure", "MonthlyCharges", "TotalCharges"]

correlations = {}
for col in numeric_cols:
    correlations[col] = np.corrcoef(df_encoded[col], df_encoded["Churn"])[0, 1]

print("Correlation with Churn:")
for k, v in correlations.items():
    print(f"{k}: {v:.4f}")


Correlation with Churn:
SeniorCitizen: 0.1505
tenure: -0.3540
MonthlyCharges: 0.1929
TotalCharges: -0.1995


From the output analysis, we now know that:
1) There is an imbalance in output: about 73% of customers stay while 26.5% leave
2) Churn is common with senior citizens and high paying customers
3) Churn is uncommon with customers that have stayed for longer  

## Support Vector Machine Implementation

In [20]:
# Separate output variable (Churn) and input features
y = df_encoded["Churn"].astype(int)  # must be int
X = df_encoded.drop("Churn", axis=1).astype(float)  # all numeric


y_binary = np.where(y == 0, -1, 1) #change Churn from {0,1} â†’ {-1, +1}

In [21]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X.values,            # convert to numpy array
    y_binary,            # numpy array
    test_size=0.2, 
    random_state=123
)

In [22]:
class SVM:
    def __init__(self, learning_rate=0.001, lambda_param=0.01, n_iters=1000):
        self.lr = learning_rate
        self.lambda_param = lambda_param
        self.n_iters = n_iters
        self.w = None
        self.b = None

    def fit(self, X, y):
        n_samples, n_features = X.shape
        y_ = np.where(y <= 0, -1, 1)

        self.w = np.zeros(n_features)
        self.b = 0

        for _ in range(self.n_iters):
            for idx, x_i in enumerate(X):
                condition = y_[idx] * (np.dot(x_i, self.w) - self.b) >= 1

                if condition:
                    self.w -= self.lr * (2 * self.lambda_param * self.w)
                else:
                    self.w -= self.lr * (2 * self.lambda_param * self.w - y_[idx] * x_i)
                    self.b -= self.lr * y_[idx]

    def predict(self, X):
        approx = np.dot(X, self.w) - self.b
        return np.sign(approx)

In [23]:
def accuracy(y_true, y_pred):
    return np.sum(y_true == y_pred) / len(y_true)

In [24]:
clf = SVM()
clf.fit(X_train, y_train)

predictions = clf.predict(X_test)

print("SVM classification accuracy:", accuracy(y_test, predictions))


SVM classification accuracy: 0.8095238095238095


## Comparison between Predicted & Actual Outputs

In [29]:
predictions = clf.predict(X_test)


svm_accuracy = accuracy(y_test, predictions)
print("Predicted labels (first 30):")
print(predictions[:30])

print("\nTrue labels (first 30):")
print(y_test[:30])

print("\nSVM Classification Accuracy:", svm_accuracy)


comparison = np.vstack((y_test[:30], predictions[:30])).T
print("\nComparison Table (first 30 rows):")
print("Format: [True Value, Predicted Value]")
print(comparison)

Predicted labels (first 30):
[-1. -1. -1. -1. -1. -1.  1. -1. -1.  1. -1. -1. -1. -1.  1. -1. -1. -1.
 -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1.]

True labels (first 30):
[-1 -1 -1 -1  1 -1  1 -1 -1  1 -1  1 -1 -1  1 -1 -1 -1 -1 -1 -1 -1 -1 -1
 -1 -1  1 -1 -1  1]

SVM Classification Accuracy: 0.8095238095238095

Comparison Table (first 30 rows):
Format: [True Value, Predicted Value]
[[-1. -1.]
 [-1. -1.]
 [-1. -1.]
 [-1. -1.]
 [ 1. -1.]
 [-1. -1.]
 [ 1.  1.]
 [-1. -1.]
 [-1. -1.]
 [ 1.  1.]
 [-1. -1.]
 [ 1. -1.]
 [-1. -1.]
 [-1. -1.]
 [ 1.  1.]
 [-1. -1.]
 [-1. -1.]
 [-1. -1.]
 [-1. -1.]
 [-1. -1.]
 [-1. -1.]
 [-1. -1.]
 [-1. -1.]
 [-1. -1.]
 [-1. -1.]
 [-1. -1.]
 [ 1. -1.]
 [-1. -1.]
 [-1. -1.]
 [ 1. -1.]]


1) SVM provides an accuracy of about 81%
2) In the previous block, we displayed the true values vs. the predicted values  