# **ML Tests - Telco Company**


Scaling and initial encoding for Machine Learning algorithm applications and comparison of metrics between models by performing tuning.

## Initial Settings

In [1]:
# Importing libraries
import pandas as pd
from sklearn.model_selection import train_test_split
import time
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, classification_report
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.pipeline import Pipeline

In [2]:
# Loading the cvs file -> Turning into a dataframe
file_path = 'pre_processed.xlsx'
df = pd.read_excel(file_path)

df.head(5)

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,Yes,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,No,Yes,No,No,No,One year,No,Mailed check,56.95,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,Yes,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,No,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,No,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,Yes


________________________________________________________

## **Scalling and Encoding**

Scaling makes the range of numeric variables between 0 and 1, focusing on proportional scale changes between variables.

OneHotEncoder turns categorical variables into numeric variables, for later application of models.

In [3]:
# Split independent variables and target variable
X = df.drop(columns=['Churn'])  
y = df['Churn'].apply(lambda x: 1 if x == 'Yes' else 0) 

# Identify categorical and numerical features
categorical_features = X.select_dtypes(include=['object', 'category']).columns
numerical_features = X.select_dtypes(include=['int64', 'float64']).columns

# Create a preprocessor (Scaling + Encoding)
preprocessor = ColumnTransformer(
    transformers=[
        ('num', MinMaxScaler(), numerical_features), 
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)  
    ])


# Split the data into training and testing sets:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


Churn = Yes (positive)

Churn = No (negative)

________________________________________________________

## **Models Application**

**Metrics:**

Train Accuracy:
- Proportion of correct predictions to the total number of records for training.

Test Accuracy:
- Proportion of correct predictions to the total number of records for testing.

Precision:
- Proportion of true positives in all records predicted as positive.
Precision = TP / (TP + FP).

Recall:
- Proportion of true positives to the total number of records that are actually positive.
Recall = TP / (TP + FN).

F1 Score:
- Harmonic mean between precision and recall.
F1 = 2 * (Precision * Recall) / (Precision + Recall).
Useful in scenarios with imbalanced classes.

ROC AUC (Receiver Operating Characteristic - Area Under the Curve):
- Ability of the model to distinguish between classes. The closer to 1, the better. True positive rate and false positive rate at various decision thresholds.



### **Random Forest**

Combines the output of multiple decision trees to achieve a single result.

In [4]:
# Configure hyperparameters for Random Forest
rf_params = {
    'classifier__n_estimators': [50, 100, 200],
    'classifier__max_depth': [None, 10, 20],
    'classifier__min_samples_split': [2, 5]
}

# Defining Pipeline 
rf_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(random_state=51))
])

# Grid Search CV - Tuning:
rf_grid = GridSearchCV(
    estimator=rf_pipeline,
    param_grid=rf_params,
    scoring='accuracy',
    cv=5,
    n_jobs=-1
)

# Training time
start_time = time.time()


# Train the model
rf_grid.fit(X_train, y_train)

# Predict on the test set
y_pred = rf_grid.predict(X_test)

# Random Forest Metrics
metrics = {
    'Model': 'Random Forest',
    'Best Parameters': rf_grid.best_params_,
    'Train Accuracy': rf_grid.best_score_,
    'Test Accuracy': accuracy_score(y_test, y_pred),
    'Precision': precision_score(y_test, y_pred),
    'Recall': recall_score(y_test, y_pred),
    'F1 Score': f1_score(y_test, y_pred),
    'ROC AUC': roc_auc_score(y_test, rf_grid.predict_proba(X_test)[:, 1]),
    'Training Time (s)': time.time() - start_time
}


# Convert metrics to DataFrame
rf_results_df = pd.DataFrame([metrics])

rf_results_df




Unnamed: 0,Model,Best Parameters,Train Accuracy,Test Accuracy,Precision,Recall,F1 Score,ROC AUC,Training Time (s)
0,Random Forest,"{'classifier__max_depth': None, 'classifier__m...",0.790061,0.790819,0.684358,0.426829,0.525751,0.842021,153.575791


### **XGBoost - eXtreme Gradient Boosting**

Based on a decision tree and using a Gradient boosting structure.

In [5]:
# Configure hyperparameters for XGBoost
xgb_params = {
    'classifier__n_estimators': [50, 100, 200],
    'classifier__learning_rate': [0.01, 0.1, 0.2],
    'classifier__max_depth': [3, 6, 10]
}

# Define Pipeline
xgb_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=51))
])

# Grid Search CV - Tuning:
xgb_grid = GridSearchCV(
    estimator=xgb_pipeline,
    param_grid=xgb_params,
    scoring='accuracy',
    cv=5,
    n_jobs=-1
)

# Training time
start_time = time.time()

# Train the model
xgb_grid.fit(X_train, y_train)

# Predict on the test set
y_pred = xgb_grid.predict(X_test)

# XGBoost Metrics
metrics = {
    'Model': 'XGBoost',
    'Best Parameters': xgb_grid.best_params_,
    'Train Accuracy': xgb_grid.best_score_,
    'Test Accuracy': accuracy_score(y_test, y_pred),
    'Precision': precision_score(y_test, y_pred),
    'Recall': recall_score(y_test, y_pred),
    'F1 Score': f1_score(y_test, y_pred),
    'ROC AUC': roc_auc_score(y_test, xgb_grid.predict_proba(X_test)[:, 1]),
    'Training Time (s)': time.time() - start_time
}

# Convert metrics to DataFrame
xgb_results_df = pd.DataFrame([metrics])

xgb_results_df


Parameters: { "use_label_encoder" } are not used.



Unnamed: 0,Model,Best Parameters,Train Accuracy,Test Accuracy,Precision,Recall,F1 Score,ROC AUC,Training Time (s)
0,XGBoost,"{'classifier__learning_rate': 0.1, 'classifier...",0.803651,0.807383,0.688488,0.531359,0.599803,0.857545,164.651905


### **KNN - k-nearest neighbors**

"Is a nonparametric supervised classifier, which uses proximity to make classifications or predictions about the clustering of an individual data point."

Font: https://www.ibm.com/br-pt/topics/knn#:~:text=O%20algoritmo%20k%2Dnearest%20neighbors%20(KNN)%2C%20ou%20k,um%20ponto%20de%20dados%20individual.

In [6]:
# Configure hyperparameters for KNN
knn_params = {
    'classifier__n_neighbors': [3, 5, 7],
    'classifier__weights': ['uniform', 'distance']
}

# Define Pipeline
knn_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', KNeighborsClassifier())
])

# Grid Search CV - Tuning:
knn_grid = GridSearchCV(
    estimator=knn_pipeline,
    param_grid=knn_params,
    scoring='accuracy',
    cv=5,
    n_jobs=-1
)

# Training time
start_time = time.time()

# Train the model
knn_grid.fit(X_train, y_train)

# Predict on the test set
y_pred = knn_grid.predict(X_test)

# KNN Metrics
metrics = {
    'Model': 'KNN',
    'Best Parameters': knn_grid.best_params_,
    'Train Accuracy': knn_grid.best_score_,
    'Test Accuracy': accuracy_score(y_test, y_pred),
    'Precision': precision_score(y_test, y_pred),
    'Recall': recall_score(y_test, y_pred),
    'F1 Score': f1_score(y_test, y_pred),
    'ROC AUC': roc_auc_score(y_test, knn_grid.predict_proba(X_test)[:, 1]),
    'Training Time (s)': time.time() - start_time
}

# Convert metrics to DataFrame
knn_results_df = pd.DataFrame([metrics])

knn_results_df


Unnamed: 0,Model,Best Parameters,Train Accuracy,Test Accuracy,Precision,Recall,F1 Score,ROC AUC,Training Time (s)
0,KNN,"{'classifier__n_neighbors': 7, 'classifier__we...",0.760446,0.765263,0.571168,0.545296,0.557932,0.798259,19.408799


### **SVM - Support Vector Machines**

"Classifies data by finding an optimal line or hyperplane that maximizes the distance between each class in an n-dimensional space.."

Font: https://www.ibm.com/br-pt/topics/support-vector-machine

In [None]:
# Configure hyperparameters for SVM
svm_params = {
    'classifier__C': [0.1, 1, 10],
    'classifier__kernel': ['linear', 'rbf']
}

# Define Pipeline
svm_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', SVC(probability=True, random_state=51))
])

# Grid Search CV - Tuning:
svm_grid = GridSearchCV(
    estimator=svm_pipeline,
    param_grid=svm_params,
    scoring='accuracy',
    cv=5,
    n_jobs=-1
)

# Training time
start_time = time.time()

# Train the model
svm_grid.fit(X_train, y_train)

# Predict on the test set
y_pred = svm_grid.predict(X_test)

# SVM Metrics
metrics = {
    'Model': 'SVM',
    'Best Parameters': svm_grid.best_params_,
    'Train Accuracy': svm_grid.best_score_,
    'Test Accuracy': accuracy_score(y_test, y_pred),
    'Precision': precision_score(y_test, y_pred),
    'Recall': recall_score(y_test, y_pred),
    'F1 Score': f1_score(y_test, y_pred),
    'ROC AUC': roc_auc_score(y_test, svm_grid.predict_proba(X_test)[:, 1]),
    'Training Time (s)': time.time() - start_time
}

# Convert metrics to DataFrame
svm_results_df = pd.DataFrame([metrics])

svm_results_df


Unnamed: 0,Model,Best Parameters,Train Accuracy,Test Accuracy,Precision,Recall,F1 Score,ROC AUC,Training Time (s)
0,SVM,"{'classifier__C': 1, 'classifier__kernel': 'li...",0.801217,0.808803,0.679325,0.560976,0.614504,0.856184,128.808987


### **Naive Bayes**

An algorithm that builds on Thomas Bayes' discoveries to make predictions in machine learning.

In [9]:
# Adjust the OneHotEncoder to return dense output
preprocessor = ColumnTransformer(
    transformers=[
        ('num', MinMaxScaler(), numerical_features),  # Scaling for numerical features
        ('cat', OneHotEncoder(handle_unknown='ignore', sparse_output=False), categorical_features)  # Dense output for categorical features
    ])

# Define Pipeline
nb_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', GaussianNB())
])

# Training time
start_time = time.time()

# Train the model
nb_pipeline.fit(X_train, y_train)

# Predict on the test set
y_pred = nb_pipeline.predict(X_test)

# Naive Bayes Metrics
metrics = {
    'Model': 'Naive Bayes',
    'Best Parameters': 'N/A',  # No hyperparameters tuned for GaussianNB
    'Train Accuracy': nb_pipeline.score(X_train, y_train),
    'Test Accuracy': accuracy_score(y_test, y_pred),
    'Precision': precision_score(y_test, y_pred),
    'Recall': recall_score(y_test, y_pred),
    'F1 Score': f1_score(y_test, y_pred),
    'ROC AUC': roc_auc_score(y_test, nb_pipeline.predict_proba(X_test)[:, 1]),
    'Training Time (s)': time.time() - start_time
}

# Convert metrics to DataFrame
nb_results_df = pd.DataFrame([metrics])

nb_results_df


Unnamed: 0,Model,Best Parameters,Train Accuracy,Test Accuracy,Precision,Recall,F1 Score,ROC AUC,Training Time (s)
0,Naive Bayes,,1.0,0.271652,0.271652,1.0,0.427242,0.5,3.07713


In [10]:
all_results_df = pd.concat([rf_results_df, xgb_results_df, knn_results_df, nb_results_df, svm_results_df])


all_results_df

Unnamed: 0,Model,Best Parameters,Train Accuracy,Test Accuracy,Precision,Recall,F1 Score,ROC AUC,Training Time (s)
0,Random Forest,"{'classifier__max_depth': None, 'classifier__m...",0.790061,0.790819,0.684358,0.426829,0.525751,0.842021,153.575791
0,XGBoost,"{'classifier__learning_rate': 0.1, 'classifier...",0.803651,0.807383,0.688488,0.531359,0.599803,0.857545,164.651905
0,KNN,"{'classifier__n_neighbors': 7, 'classifier__we...",0.760446,0.765263,0.571168,0.545296,0.557932,0.798259,19.408799
0,Naive Bayes,,1.0,0.271652,0.271652,1.0,0.427242,0.5,3.07713
0,SVM,"{'classifier__C': 1, 'classifier__kernel': 'li...",0.801217,0.808803,0.679325,0.560976,0.614504,0.856184,128.808987


As the focus is to identify customers who tend to churn for treatment, intervening in all customers who will churn, even if this results in some false positives (extra expenses), the analysis **priority should be the Recall metric.**
Impacts: High cost (loss of revenue, loss of market share, etc.).

**Model definition:**

Given the model metrics, we can analyze that the model with the highest Recall and accuracy metrics with good performance, having an ok F1 Score, **is the SVM model.**

P.S.: Since the database has few records but many columns, this model had one of the slowest performances. For a database with a high volume of data, consider the KNN model, which presented reasonable metrics but low application time.

In [15]:
print('Best parameters to SVM application:')
svm_grid.best_params_

Best parameters to SVM application:


{'classifier__C': 1, 'classifier__kernel': 'linear'}