# 01 ‚Äî Load and Recognize Data 

in this notebook, we will:
- Load the dataset
- Clean the dataset
- Explore the dataset


## Load the Data

In [None]:
# We will use pandas to load the dataset, which is in CSV format
import pandas as pd

# Load the churn dataset
df = pd.read_csv("data/churn.csv")

# Look at the first few rows
df.head()

Unnamed: 0,gender,SeniorCitizen,Partner,tenure,PhoneService,MultipleLines,InternetService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,Female,0,Yes,1,No,No phone service,DSL,Month-to-month,Electronic check,29.85,29.85,No
1,Male,0,No,34,Yes,No,DSL,One year,Mailed check,56.95,1889.5,No
2,Male,0,No,2,Yes,No,DSL,Month-to-month,Mailed check,53.85,108.15,Yes
3,Male,0,No,45,No,No phone service,DSL,One year,Bank transfer (automatic),42.3,1840.75,No
4,Female,0,No,2,Yes,No,Fiber optic,Month-to-month,Electronic check,70.7,151.65,Yes


## üìä Dataset Overview 


This dataset contains customer information related to **subscription services and churn behavior**.  

### üßë‚Äçü§ù‚Äçüßë Demographic Information
- **gender**: Customer gender (`Male`, `Female`)
- **SeniorCitizen**: Whether the customer is a senior citizen (`1` = Yes, `0` = No)
- **Partner**: Whether the customer has a partner (`Yes` / `No`)

### üìÖ Account & Contract Information
- **tenure**: Number of months the customer has stayed with the company
- **Contract**: Type of contract (`Month-to-month`, `One year`, `Two year`)
- **PaymentMethod**: Payment method used by the customer

### üìû Services Subscribed
- **PhoneService**: Whether the customer has phone service
- **MultipleLines**: Whether the customer has multiple phone lines
- **InternetService**: Type of internet service (`DSL`, `Fiber optic`, or `No`)

### üí∞ Billing & Charges
- **MonthlyCharges**: Monthly amount charged to the customer
- **TotalCharges**: Total amount charged to the customer over the customer‚Äôs lifetime

### üîÅ Churn Information
- **Churn**: Indicates whether the customer left the service (`Yes` / `No`)

## Check the Data


In [2]:
# Shape of the dataset
df.shape

(7043, 12)

In [3]:
# Check data types and non-null counts
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 12 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   gender           7043 non-null   object 
 1   SeniorCitizen    7043 non-null   int64  
 2   Partner          7043 non-null   object 
 3   tenure           7043 non-null   int64  
 4   PhoneService     7043 non-null   object 
 5   MultipleLines    7043 non-null   object 
 6   InternetService  7043 non-null   object 
 7   Contract         7043 non-null   object 
 8   PaymentMethod    7043 non-null   object 
 9   MonthlyCharges   7043 non-null   float64
 10  TotalCharges     7043 non-null   object 
 11  Churn            7043 non-null   object 
dtypes: float64(1), int64(2), object(9)
memory usage: 660.4+ KB


> `TotalCharges` column is not numeric as expected. We will need to clean it.

In [4]:
# Check messing values
df.isnull().sum()

gender             0
SeniorCitizen      0
Partner            0
tenure             0
PhoneService       0
MultipleLines      0
InternetService    0
Contract           0
PaymentMethod      0
MonthlyCharges     0
TotalCharges       0
Churn              0
dtype: int64

> No missing values in the dataset. Great!

In [5]:
# Check duplicates
df.duplicated().sum()

np.int64(42)

> There are Some Duplicate Rows in the Dataset. We will remove them.

In [6]:
# drop duplicates
df = df.drop_duplicates()
df.duplicated().sum()

np.int64(0)

> Now, the dataset has no duplicates.

In [7]:
# check Summary statistics for numerical columns
df.describe(include='number')

Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges
count,7001.0,7001.0,7001.0
mean,0.162977,32.559349,64.962377
std,0.369371,24.512177,30.032081
min,0.0,0.0,18.25
25%,0.0,9.0,35.9
50%,0.0,29.0,70.45
75%,0.0,56.0,89.9
max,1.0,72.0,118.75


In [8]:
# check Summary statistics for categorical columns
df.describe(exclude='number')

Unnamed: 0,gender,Partner,PhoneService,MultipleLines,InternetService,Contract,PaymentMethod,TotalCharges,Churn
count,7001,7001,7001,7001,7001,7001,7001,7001.0,7001
unique,2,2,2,3,3,3,4,6531.0,2
top,Male,No,Yes,No,Fiber optic,Month-to-month,Electronic check,,No
freq,3526,3600,6319,3348,3089,3833,2357,11.0,5151


In [9]:
# Clean TotalCharges column
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')  # Any non-numeric values will be set to NaN

In [10]:
# Check the missing values again
df.isnull().sum()

gender              0
SeniorCitizen       0
Partner             0
tenure              0
PhoneService        0
MultipleLines       0
InternetService     0
Contract            0
PaymentMethod       0
MonthlyCharges      0
TotalCharges       11
Churn               0
dtype: int64

> Now, We have missing values in `TotalCharges` column after conversion. We will drop those rows.

In [11]:
# Drop rows with missing 
df.dropna(inplace=True)

In [12]:
# Check the missing values again
df.isnull().sum()

gender             0
SeniorCitizen      0
Partner            0
tenure             0
PhoneService       0
MultipleLines      0
InternetService    0
Contract           0
PaymentMethod      0
MonthlyCharges     0
TotalCharges       0
Churn              0
dtype: int64

## Save Cleaned Data

In [None]:
from pathlib import Path

# Create data directory if it doesn't exist
Path("data").mkdir(exist_ok=True)

# Save Cleaned Data
df.to_csv("data/churn_cleaned.csv", index=False)

## Data Profiling

> This is optional. You can skip this section if you want.

> Exploring the data using profiling report can help you understand the data better and find more insights.

In [None]:
# We will use ydata-profiling to generate a profiling report
from ydata_profiling import ProfileReport

# Generate a profiling report
profile = ProfileReport(df, title="Churn Data Profiling Report", progress_bar=False)

In [None]:
# Create reports directory if it doesn't exist
Path("reports").mkdir(exist_ok=True)

# Save the report as an HTML file (You can open this file in any browser)
profile.to_file("reports/churn_data_profiling_report.html")

100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 12/12 [00:00<00:00, 101.17it/s]


## ‚úÖ What We Learned in This Notebook

- How to load data using pandas
- How to clean data
- How to save cleaned data
- How to generate a data profiling 


‚û°Ô∏è Next: Go to `02_train_and_evaluate.ipynb` to train and evaluate a machine learning model.

