#**Introduction**
-------

To retain treasured subscribers in state-of-the-art aggressive market, telecom organizations have to be able to are expecting purchaser attrition. This be counted is tackled by way of this venture utilising the data dataset from Kaggle. Customer facts and attrition indicators are contained on this huge dataset. The study will assemble a machine learning pipeline that predicts the likelihood of turnover primarily based on an evaluation of consumer behaviour to be able to fight attrition. Potential benefits of identifying at-hazard consumers include the implementation of centered retention strategies and a discount in customer turnover.

** Dataset Link: https://www.kaggle.com/datasets/blastchar/telco-customer-churn **


#**Problem Statement**
-----


Customer churn, the loss of subscribers, is a major financial concern for telecom companies. The motive of this endeavour is to create a churn prediction version in order to deter clients from leaving.  Prolonged patron attrition costs considerably hinder growth and sales, necessitating ongoing tasks to gather sparkling clients.  As stated by Zhang et al. (2022), should the attrition prediction machine prove to be powerful, the employer might also favor to put into effect centered retention strategies, which includes stronger provider plans or distinct reductions, to draw and maintain purchasers who are at danger.  This problem may be characterized as a binary classification endeavour that uses consumer information, which include demographics, invoicing records, and provider utilization styles.  The predictive model determines whether a user will remain churn (1) or remain a subscriber (0).  This in the end fosters customer loyalty and pride through allowing the agency to allocate assets strategically and personalize approaches to client retention.

#**Data Exploration**
----


In [1]:
import pandas as pd
# Load the dataset
data = pd.read_csv('data.csv')

# Describe the dataset
print(data.info())
print(data.describe())
print(data.head())

# Check for missing values
print(data.isnull().sum())

# Check for outliers (assuming numerical columns)
numerical_cols = data.select_dtypes(include=['int64', 'float64']).columns
for col in numerical_cols:
    # Calculate the IQR (Interquartile Range)
    Q1 = data[col].quantile(0.25)
    Q3 = data[col].quantile(0.75)
    IQR = Q3 - Q1

    # Check for outliers using the IQR method
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    outliers = ((data[col] < lower_bound) | (data[col] > upper_bound)).sum()
    print(f"Outliers in {col}: {outliers}")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


The chosen dataset, data, is to be had to be used on Kaggle and accommodates consumer facts this is pertinent to the churn estimation of a telecommunications business enterprise. Important elements are likely to include demographics, subscription carrier plans (e.g., internet, smartphone), usage patterns (e.g., information usage, call duration), and invoice records (e.G., tenure, price technique). Anonymity detection in use pattern anomalies, evaluation of churn label distribution (probably unequal inside the case of a extra percentage of non-churned clients), and search for absent variables are all additives of information investigation. Data cleaning can also involve the removal of items which have a giant fee of missing values or the meticulous inputting of absent records (Nalatissifa and Pardede, 2021).  Depending on the diploma of imbalance, sampling techniques inclusive of oversampling or undersampling the minority magnificence (churned clients) can be required to make sure that the version learns efficiently.  In order to assess the performance of the version, class metrics consisting of accuracy, precision, recall, and the F1 rating can be hired.  Accuracy assesses average correctness, whereas precision and recall are concerned with precisely identifying non-churned and churned consumers, respectively.  The F1 score offers a rational assessment of each.

#**Data Preprocessing & Feature Engineering**
-----


In [4]:
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
# Convert 'TotalCharges' to numeric, replacing non-numeric values with NaN
data['TotalCharges'] = pd.to_numeric(data['TotalCharges'], errors='coerce')

# Handle missing values using the mean strategy
imputer = SimpleImputer(strategy='mean')
data['TotalCharges'] = imputer.fit_transform(data[['TotalCharges']])

# Encode categorical features
encoder = OneHotEncoder(drop='first')
encoded_features = encoder.fit_transform(data[['gender', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines',
                                                'InternetService', 'OnlineSecurity', 'OnlineBackup',
                                                'DeviceProtection', 'TechSupport', 'StreamingTV',
                                                'StreamingMovies', 'Contract', 'PaperlessBilling',
                                                'PaymentMethod']]).toarray()
encoded_df = pd.DataFrame(encoded_features, columns=encoder.get_feature_names_out(['gender', 'Partner', 'Dependents',
                                                                              'PhoneService', 'MultipleLines',
                                                                              'InternetService', 'OnlineSecurity',
                                                                              'OnlineBackup', 'DeviceProtection',
                                                                              'TechSupport', 'StreamingTV',
                                                                              'StreamingMovies', 'Contract',
                                                                              'PaperlessBilling', 'PaymentMethod']))

# Scale numerical features
scaler = StandardScaler()
scaled_features = scaler.fit_transform(data[['tenure', 'MonthlyCharges', 'TotalCharges']])
scaled_df = pd.DataFrame(scaled_features, columns=['tenure', 'MonthlyCharges', 'TotalCharges'])

# Combine the encoded and scaled features with the original dataset
data_processed = pd.concat([data.drop(columns=['gender', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines',
                                                'InternetService', 'OnlineSecurity', 'OnlineBackup',
                                                'DeviceProtection', 'TechSupport', 'StreamingTV',
                                                'StreamingMovies', 'Contract', 'PaperlessBilling',
                                                'PaymentMethod', 'tenure', 'MonthlyCharges', 'TotalCharges']),
                             encoded_df, scaled_df], axis=1)

Data preparation will be crucial for building a robust model. The study will  address missing values strategically: imputing numerical features with appropriate methods (mean, median) and potentially removing entries with too many missing values. Additionally, entries with a massive range of lacking values may be eliminated. In order to mitigate the effect of utilisation sample anomalies on the model, such styles can be limited or winorized. In order to numerically represent specific facts, together with provider plans for the model, one-hot encoding might be implemented (Momin et al. 2020). Through feature engineering, extra features may be delivered to a model to decorate its efficacy.  Deductions consisting of "general month-to-month spending" may be made from billing and consumption records. Additionally, classification functions for customer tenure (e.G., "much less than 12 months," "1-2 years") should be included so that you can gain a greater comprehensive information of the movements taken by means of clients who discontinue their patronage at one of a kind tiers in their lifetimes.

#**Model Training**
-----


In [8]:
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import StandardScaler


# Handle missing values
data['TotalCharges'] = pd.to_numeric(data['TotalCharges'], errors='coerce')
data.dropna(inplace=True)

# Encode categorical features
X = data.drop('Churn', axis=1)
X = pd.get_dummies(X, drop_first=True)
y = data['Churn'].map({'Yes': 1, 'No': 0})

# Check the shape of X and y
print("Shape of X:", X.shape)
print("Shape of y:", y.shape)

# Check the distribution of the target variable
print("Distribution of y:")
print(y.value_counts())

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Choose and justify the algorithm (Logistic Regression)
model = LogisticRegression()

# Hyperparameter tuning
param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100]}  # Regularization parameter
grid_search = GridSearchCV(model, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Get the best hyperparameters
best_params = grid_search.best_params_

# Train the model with the best hyperparameters
best_model = LogisticRegression(**best_params)
best_model.fit(X_train, y_train)

# Predict on the test set
y_pred = best_model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print(f"Best hyperparameters: {best_params}")
print(f"Accuracy: {accuracy}")
print("Classification Report:\n", report)


Shape of X: (7043, 7072)
Shape of y: (7043,)
Distribution of y:
Churn
0    5174
1    1869
Name: count, dtype: int64
Best hyperparameters: {'C': 10}
Accuracy: 0.7863733144073811
Classification Report:
               precision    recall  f1-score   support

           0       0.79      0.96      0.87      1036
           1       0.72      0.31      0.44       373

    accuracy                           0.79      1409
   macro avg       0.76      0.64      0.65      1409
weighted avg       0.78      0.79      0.75      1409



Logistic Regression for Churn Prediction:
--------------
Given that it is able to control the binary category hassle of attrition prediction and that the dataset is probable to include both numeric and express elements, logistic regression is an fantastic alternative. It accurately simulates the relationship that exists among attributes and binary consequences (Fujo et al. 2022). A linear combination of weighted attributes is utilised with the aid of Logistic Regression to compute the opportunity that a given facts point is a member of a specific magnificence (churned in this study case).

Hyperparameter Tuning for Optimal Performance:
----------------
GridSearchCV will be hired to optimise the version and acquire the maximum beneficial consequences. This method employs pass-validation to assess various model configurations through the exam of a predetermined grid of hyperparameter values ('C' being the regularisation parameter on this example). In order to decide the most advantageous version, the configuration that achieves the best accuracy at the validation set can be evaluated (Mustafa et al. 2021).

The provided code demonstrates several key steps:
-----------
* Data Preprocessing: In the method of statistics preprocessing, numerical conversion and row elimination are carried out to deal with the issue of missing values in the "TotalCharges" column when a couple of gadgets are absent.
* Feature Engineering: One-warm encoding is utilised to encode express features so that the version can comprehend them.
* Train-Test Split: The method we employ partitions the statistics into training and testing units, with an 80%/20% department as illustrated inside the technique.
* Feature Scaling: In order to maintain a uniform scale for all capabilities, the 'StandardScaler' is hired to standardise numerical features.
* Logistic Regression Model: A clean 'LogisticRegression' example is created.
* Hyperparameter Tuning: A grid seek is accomplished making use of the 'GridSearchCV' characteristic, incorporating exclusive values of the regularisation parameter 'C'. Maximum accuracy turned into received by the chosen version on the validation folds.
* Model Training: The closing version, which incorporates the most most beneficial hyperparameters, is constructed using the education set.
* Evaluation: In order to evaluate the model's performance on test facts that it has not encountered earlier than, an assessment is conducted making use of a category record and accuracy metrics.

#**Model Assessment**
----

In [11]:
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import pandas as pd

# Assuming 'data_processed' contains the preprocessed dataset

# Separate features and target variable
X = data_processed.drop('Churn', axis=1)
y = data_processed['Churn']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define preprocessing steps for numeric and categorical features
numeric_features = X.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_features = X.select_dtypes(include=['object']).columns.tolist()

numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# Define the model
model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression())
])

# Hyperparameter tuning
param_grid = {'classifier__C': [0.001, 0.01, 0.1, 1, 10, 100]}  # Regularization parameter
grid_search = GridSearchCV(model, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Get the best hyperparameters for LogisticRegression
best_params_lr = grid_search.best_params_['classifier__C']

# Train the model with the best hyperparameters
best_model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(C=best_params_lr))
])
best_model.fit(X_train, y_train)

# Predict on the test set
y_pred = best_model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print(f"Best hyperparameters: {best_params}")
print(f"Accuracy: {accuracy}")
print("Classification Report:\n", report)


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Best hyperparameters: {'classifier__C': 1}
Accuracy: 0.8218594748048261
Classification Report:
               precision    recall  f1-score   support

          No       0.86      0.90      0.88      1036
         Yes       0.69      0.60      0.64       373

    accuracy                           0.82      1409
   macro avg       0.77      0.75      0.76      1409
weighted avg       0.82      0.82      0.82      1409



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


The final model's performance is evaluated on unseen test data using a combination of metrics. Accuracy (achieved score) measures the overall correct predictions. However, accuracy alone might be misleading in imbalanced datasets (Wanikar et al. 2024). A categorization file incorporates data in greater element. Recall quantifies the accuracy with which the machine identifies actual-existence churn eventualities, at the same time as precision signifies the percentage of predicted expelled consumers that during reality departed. These KPIs can be evaluated to determine the accuracy with which the model forecasts patron attrition.

#**Final Discussion**
---

This studies investigated the construction of a gadget studying infrastructure for a telecommunications employer to forecast client attrition. The Logistic Regression version, which underwent pre-processing for each categorical and numeric capabilities, demonstrated an accuracy rating of 0.79 while implemented to the checking out facts set. The categorization document encompassed supplementary information relating accuracy and recollect. Limitations persist regardless of the version's capability of identifying clients characterized by a excessive price of attrition. It is feasible for logistic regression to miss complicated relationships amongst capabilities. Furthermore, the selected metrics may be skewed if the dataset is unbalanced between instructions (with a extra share of non-churned clients).


The consequences of this study indicate that machine getting to know ought to probably be implemented as a safety measure towards attrition. By identifying at-risk customers, the company can implement targeted retention strategies.  These endeavours will be spearheaded with the aid of emphasising the informative attributes that the version has diagnosed, together with short tenancies and coffee monthly consumption.

Based on this analysis, the study recommend:
---------------------

* Refining the Model: Consider using more problematic algorithms, together with Random Forests or Gradient Boosting Machines, which can be capable of taking pictures non-linear interactions.
* Addressing Class Imbalance: In the occasion that the records well-knownshows an imbalance, strategies which include undersampling or oversampling the minority elegance (customers who have departed) may be carried out.
* Model Explainability: If the very last model selected allows it (as in Random Forests, as an example), a extra complete understanding of customer attrition behaviour may be acquired via analysing the maximum extensive attributes.

By iteratively improving the model and incorporating explainability techniques, the company can gain valuable insights to develop effective customer retention strategies and minimize churn rates.

# References
------
* Fujo, S.W., Subramanian, S. and Khder, M.A., 2022. Customer churn prediction in telecommunication industry using deep learning. Information Sciences Letters, 11(1), p.24.
* Momin, S., Bohra, T. and Raut, P., 2020. Prediction of customer churn using machine learning. In EAI International Conference on Big Data Innovation for Sustainable Cognitive Computing: BDCC 2018 (pp. 203-212). Springer International Publishing.
* Mustafa, N., Ling, L.S. and Razak, S.F.A., 2021. Customer churn prediction for telecommunication industry: A Malaysian Case Study. F1000Research, 10.
* Nalatissifa, H. and Pardede, H.F., 2021. Customer decision prediction using deep neural network on telco customer churn data. Jurnal Elektronika dan Telekomunikasi, 21(2), pp.122-127.
* Wanikar, P., Maurya, S., Vishvakarma, M., Sujatha, K., Rakesh, N., Vimal, V. and Shelke, N., 2024. Telco Customer Churn Prediction Using ML Models. International Journal of Intelligent Systems and Applications in Engineering, 12(2), pp.644-653.
* Zhang, T., Moro, S. and Ramos, R.F., 2022. A data-driven approach to improve customer churn prediction based on telecom customer segmentation. Future Internet, 14(3), p.94.
