# __XGBoost Tuning Lab - EDA__


In this notebook we'll be __exploring and understanding the data used in this demonstration, going through a step-by-step analysis making decisions for the data preprocessing__.

The __functions in this code are meant to be reproducible__ so you could run them with your data!

To solve any machine learning problem you should always look to understand your data first, assess your problem, and determine what are the best ways to solve it. 

---

### Notebook Index:


## <font color='#696969'> 1. Dependencies & Settings</font>

In this section, we import the necessary libraries and configure the environment to ensure a streamlined workflows. This includes setting up paths, loading packages and user defined functions, and defining settings that will be used throughout the project.

In [1]:
# Magic commands
%autosave 90

# Import packages
import os
import sys
import warnings
import numpy as np
import pandas as pd
# import seaborn as sns

# Add the src directory to the sys path
sys.path.append(os.path.abspath(os.path.join('..', 'src')))

# Import user defined modules
import data_loader

# Ignore warnings
warnings.filterwarnings('ignore')

# Set visual theme
# sns.set_theme(style = 'white', palette = None)

# Set up directories
metadata_file_path = os.path.abspath(os.path.join('..', 'metadata', 'metadata.json'))

Autosaving every 90 seconds


## <font color='#696969'> 2. Data Collection & Understanding</font>

In this section, we automatically load the extracted data using the metadata generated by the data ingestor, then explore its structure to identify initial patterns, relationships, and potential errors, analyzing their connection to the problem at hand.

In [2]:
# Initialize DataLoader
data_loader_instance = data_loader.DataLoader(metadata_file_path)

# Load data using metadata created by the data ingestor
df = data_loader_instance.load_data()

Loading file...


Loaded [1mWA_Fn-UseC_-Telco-Customer-Churn[0m with 7043 rows and 21 features.



Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,Yes,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,No,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,Yes,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,No,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,No,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


We are working with a telecom dataset where __each row represents a customer, each feature captures specific attributes about them, and the target variable is Churn pinpointing clients who left within the last month__. Customer churn refers to when customers stop using a company's products or services.

The goal is to __build a model to predict customer churn__, enabling the company to create strategies for retaining customers before they switch to a competitor. In the telecom industry, customers frequently switch providers, often influenced by more attractive offers.

Personalized retention for all customers is costly and inefficient. However, __accurately predicting high-risk customers allows companies to focus retention efforts strategically, maximizing efficiency and minimizing losses__.

In [45]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


We can immediatly observe that there a