# English

## 1. Introduction

In this project, we will explore the possibilities of using a Machine Learning Model to predict when a customer will churn based on their usage patterns and demographics, based on a dataset provided by Kaggle at this [link](https://www.kaggle.com/datasets/blastchar/telco-customer-churn/data).
The primary goal is to build a model that can accurately classify customers as likely to churn or not, enabling the company to take proactive measures to retain valuable customers.
The workflow included the following key stages:

1. **Data Extraction from Kaggle using an API**:
    The Kagglehub API was used to extract the data correctly and ensure it remains updated if there are changes to the data source.
2. **Data Cleaning and Preprocessing**:
    This stage involved handling missing values, encoding categorical variables, and normalizing numerical features to prepare the data for modeling.
3. **Exploratory Data Analysis (EDA)**:
    EDA was conducted to understand the distribution of features, identify patterns, and visualize relationships between variables.
4. **Feature Engineering**:
    New features were created based on domain knowledge and insights gained from EDA to enhance the model's predictive power.
5. **Model Selection and Training**:
    Various machine learning algorithms were evaluated, including Logistic Regression, Decision Trees and Random Forests. The models were trained using cross-validation to ensure robustness.
6. **Model Evaluation**:
    The models were assessed using metrics such as accuracy, precision, recall, F1-score, and ROC-AUC to determine their performance.
7. **Hyperparameter Tuning**:
    Grid Search and Random Search techniques were employed to optimize model parameters for better performance.
8. **Visualization of Results**:
    The results were visualized using confusion matrices, ROC curves, and feature importance plots to interpret the model's predictions.

## 2. Project Development



This section will detail step by step the process followed to develop the analysis and the model, including code snippets, visualizations, and explanations of each step.

### Importing Libraries

In [None]:
import pandas as pd
import os
import kagglehub
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import random
from sqlalchemy  import create_engine
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.metrics import roc_auc_score, average_precision_score
from sklearn.metrics import RocCurveDisplay, PrecisionRecallDisplay
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
import joblib
import warnings
warnings.filterwarnings('ignore')


In [2]:
# The dataset is downloaded from Kaggle
path = kagglehub.dataset_download("blastchar/telco-customer-churn")

# Name of the downloaded file
file_name = "WA_Fn-UseC_-Telco-Customer-Churn.csv"

# The full file path is constructed
file_path = os.path.join(path, file_name)

# The file is loaded into a pandas DataFrame
telco_data = pd.read_csv(file_path)

In [3]:
telco_data.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [4]:
telco_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


In [6]:
telco_data.describe()

Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges
count,7043.0,7043.0,7043.0
mean,0.162147,32.371149,64.761692
std,0.368612,24.559481,30.090047
min,0.0,0.0,18.25
25%,0.0,9.0,35.5
50%,0.0,29.0,70.35
75%,0.0,55.0,89.85
max,1.0,72.0,118.75
