# **XGBoost Tuning Lab - EDA**


In this notebook we'll be **exploring and understanding the data used in this demonstration, going through a step-by-step analysis making decisions for the data preprocessing**.

The __functions in this code are meant to be reproducible__ so you could run them with your data!

To solve any machine learning problem you should always look to understand your data first, assess your problem, and determine what are the best ways to solve it. 

---

### Notebook Index:


## <font color='SeaGreen'> 1. Dependencies & Settings</font>

In this section, we import the necessary libraries and configure the environment to ensure a streamlined workflows. This includes setting up paths, loading packages and user defined functions, and defining settings that will be used throughout the project.

In [3]:
# Magic commands
%autosave 90
%matplotlib inline

# Import packages
import os
import json
import warnings
import numpy as np
import pandas as pd
import seaborn as sns

# Ignore warnings
warnings.filterwarnings('ignore')

# Set visual theme
sns.set_theme(style = 'white', palette = None)

# Set up directories
current_folder = os.getcwd()
notebook_folder = os.path.dirname(current_folder)
main_folder = os.path.dirname(notebook_folder)
metadata_file_path = os.path.join(main_folder, 'metadata', 'metadata.json')

## <font color='SeaGreen'> 2. Data Collection & Understanding</font>

In this section, we automatically load the extracted data using the data ingestor, then explore its structure to identify initial patterns, relationships, and potential errors, analyzing their connection to the problem at hand.

In [38]:
# Load the metadata
with open(metadata_file_path, 'r') as metadata_file:
    metadata = json.load(metadata_file)

# Get the number of extracted files for flow control
if len(metadata) == 1:
    print(f'Loading file...\n\n')
    singular_file = True
else: 
    print(f'Loading files...\n\n')
    singular_file = False
    dfs = []

# Iterate over metadata and load the extracted data
for entry in metadata:
    try:
        # Extract file name, file path without extesion, and import instructions
        file_name = entry.get('file_name')
        file_name_no_ext = os.path.splitext(file_name)[0] 
        import_function = entry['import_instructions']['function']
        import_args = entry['import_instructions']['arguments']

        # Use eval to dynamically call the importing pandas function
        df = eval(import_function)(**import_args)

        if singular_file:
            print(f'Loaded \033[1m{file_name_no_ext}\033[0m with {len(df)} rows and {len(df.columns)} features.\n')
            display(df.head().style)
        else: 
            print(f'Loaded \033[1m{file_name_no_ext}\033[0m with {len(df)} rows and {len(df.columns)} features.\n')
            display(df.head())
            dfs.append(df)

    except Exception as e:
        print(f'Failed to load data from {file_name}: {e}')

Loading file...


Loaded [1mWA_Fn-UseC_-Telco-Customer-Churn[0m with 7043 rows and 21 features.



Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,Yes,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,No,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,Yes,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,No,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,No,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes
