# Customer Lifecycle Analytics  
## Data Understanding & Initial Exploration

### Objective
The objective of this notebook is to understand the structure, quality,
and business meaning of the raw customer churn dataset before performing
any data cleaning or transformation.

### Key Questions
- What customer attributes are available in the dataset?
- Are there missing or inconsistent values?
- Which features are numerical vs categorical?
- What is the target variable representing churn?

### Dataset Location
The raw dataset used in this notebook is stored at:




In [2]:
import pandas as pd
import numpy as np  

df = pd.read_csv("../data/raw/customer_churn_raw.csv")
df.head()


Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,Complain,Satisfaction Score,Card Type,Point Earned
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1,1,2,DIAMOND,464
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0,1,3,DIAMOND,456
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1,1,3,DIAMOND,377
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0,0,5,GOLD,350
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0,0,5,GOLD,425


In [4]:
df.columns



Index(['rownumber', 'customerid', 'surname', 'creditscore', 'geography',
       'gender', 'age', 'tenure', 'balance', 'numofproducts', 'hascrcard',
       'isactivemember', 'estimatedsalary', 'exited', 'complain',
       'satisfaction_score', 'card_type', 'point_earned'],
      dtype='object')

In [3]:

df.columns = (
    df.columns
    .str.strip()
    .str.lower()
    .str.replace(" ", "_")
)


In [4]:

df.isnull().sum()


rownumber             0
customerid            0
surname               0
creditscore           0
geography             0
gender                0
age                   0
tenure                0
balance               0
numofproducts         0
hascrcard             0
isactivemember        0
estimatedsalary       0
exited                0
complain              0
satisfaction_score    0
card_type             0
point_earned          0
dtype: int64

In [5]:
df = df.dropna()


In [6]:
df["exited"].value_counts()


exited
0    7962
1    2038
Name: count, dtype: int64

In [7]:
df["exited"].value_counts(normalize=True) * 100


exited
0    79.62
1    20.38
Name: proportion, dtype: float64

In [8]:
df = df.rename(columns={
    "exited": "churn",
    "hascrcard": "has_credit_card",
    "isactivemember": "is_active_member"
})

df.columns


Index(['rownumber', 'customerid', 'surname', 'creditscore', 'geography',
       'gender', 'age', 'tenure', 'balance', 'numofproducts',
       'has_credit_card', 'is_active_member', 'estimatedsalary', 'churn',
       'complain', 'satisfaction_score', 'card_type', 'point_earned'],
      dtype='object')

In [None]:
# -------- Save Cleaned Dataset --------

# Rename important columns for clarity
df = df.rename(columns={
    "exited": "churn",
    "hascrcard": "has_credit_card",
    "isactivemember": "is_active_member"
})

# Create processed data folder if not exists
import os
os.makedirs("../data/processed", exist_ok=True)

# Save cleaned dataset
df.to_csv("../data/processed/customer_churn_clean.csv", index=False)

print("âœ… Cleaned dataset saved successfully!")
