# 01 ‚Äî From Business Story to Data  
## Understanding the Problem Before Writing Any Code

In this notebook, we will **not train a model yet**.

Our goal is to understand:
- the business problem (customer churn)
- how this problem becomes a machine learning task
- how data represents past examples
- what are **inputs** and what is the **output**


## The Business Problem: Customer Churn

A company provides a service to customers.

Some customers **stay**.  
Some customers **leave** (this is called **churn**).

### Business question:
> Can we predict if a customer is likely to leave?

Why this matters:
- Keeping customers is cheaper than finding new ones
- If we can predict churn early, the business can take action


## Load the Data

In [26]:
import pandas as pd

# Load the churn dataset
df = pd.read_csv("data/churn.csv")

# Look at the first few rows
df.head()

Unnamed: 0,gender,SeniorCitizen,Partner,tenure,PhoneService,MultipleLines,InternetService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,Female,0,Yes,1,No,No phone service,DSL,Month-to-month,Electronic check,29.85,29.85,No
1,Male,0,No,34,Yes,No,DSL,One year,Mailed check,56.95,1889.5,No
2,Male,0,No,2,Yes,No,DSL,Month-to-month,Mailed check,53.85,108.15,Yes
3,Male,0,No,45,No,No phone service,DSL,One year,Bank transfer (automatic),42.3,1840.75,No
4,Female,0,No,2,Yes,No,Fiber optic,Month-to-month,Electronic check,70.7,151.65,Yes


## Check the Data


In [27]:
# Shape of the dataset
df.shape

(7043, 12)

In [28]:
# Check data types and non-null counts
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 12 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   gender           7043 non-null   object 
 1   SeniorCitizen    7043 non-null   int64  
 2   Partner          7043 non-null   object 
 3   tenure           7043 non-null   int64  
 4   PhoneService     7043 non-null   object 
 5   MultipleLines    7043 non-null   object 
 6   InternetService  7043 non-null   object 
 7   Contract         7043 non-null   object 
 8   PaymentMethod    7043 non-null   object 
 9   MonthlyCharges   7043 non-null   float64
 10  TotalCharges     7043 non-null   object 
 11  Churn            7043 non-null   object 
dtypes: float64(1), int64(2), object(9)
memory usage: 660.4+ KB


> `TotalCharges` column is not numeric as expected. We will need to clean it.

In [29]:
# Check messing values
df.isnull().sum()

gender             0
SeniorCitizen      0
Partner            0
tenure             0
PhoneService       0
MultipleLines      0
InternetService    0
Contract           0
PaymentMethod      0
MonthlyCharges     0
TotalCharges       0
Churn              0
dtype: int64

> No missing values in the dataset. Great!

In [30]:
# Check duplicates
df.duplicated().sum()

np.int64(42)

> There are Some Duplicate Rows in the Dataset. We will remove them.

In [31]:
# drop duplicates
df = df.drop_duplicates()
df.duplicated().sum()

np.int64(0)

> Now, the dataset has no duplicates.

In [None]:
# check Summary statistics for numerical columns
df.describe(include='number')

Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges
count,7001.0,7001.0,7001.0
mean,0.162977,32.559349,64.962377
std,0.369371,24.512177,30.032081
min,0.0,0.0,18.25
25%,0.0,9.0,35.9
50%,0.0,29.0,70.45
75%,0.0,56.0,89.9
max,1.0,72.0,118.75


In [None]:
# check Summary statistics for categorical columns
df.describe(exclude='number')

Unnamed: 0,gender,Partner,PhoneService,MultipleLines,InternetService,Contract,PaymentMethod,TotalCharges,Churn
count,7001,7001,7001,7001,7001,7001,7001,7001.0,7001
unique,2,2,2,3,3,3,4,6531.0,2
top,Male,No,Yes,No,Fiber optic,Month-to-month,Electronic check,,No
freq,3526,3600,6319,3348,3089,3833,2357,11.0,5151


In [32]:
# Clean TotalCharges column
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')  # Any non-numeric values will be set to NaN

In [33]:
# Check the missing values again
df.isnull().sum()

gender              0
SeniorCitizen       0
Partner             0
tenure              0
PhoneService        0
MultipleLines       0
InternetService     0
Contract            0
PaymentMethod       0
MonthlyCharges      0
TotalCharges       11
Churn               0
dtype: int64

> Now, We have missing values in `TotalCharges` column after conversion. We will drop those rows.

In [None]:
# Drop rows with missing 
df.dropna(inplace=True)

In [None]:
# Check the missing values again
df.isnull().sum()

gender             0
SeniorCitizen      0
Partner            0
tenure             0
PhoneService       0
MultipleLines      0
InternetService    0
Contract           0
PaymentMethod      0
MonthlyCharges     0
TotalCharges       0
Churn              0
dtype: int64

## Save Cleaned Data

In [None]:
# Save Cleaned Data
df.to_csv("data/churn_cleaned.csv", index=False)

## ‚úÖ What We Learned in This Notebook

- How a business problem becomes a machine learning task
- Data represents **past examples**
- Each row = one customer
- Inputs (X) are different from output (y)

üö´ We did NOT train a model yet  
‚û°Ô∏è That comes next
