# üè¶ Bank Customer Churn Prediction
## Notebook 1 ‚Äî Data Upload & First Look

**Project Goal:**  Build a machine learning pipeline that identifies bank customers likely to *churn* (close their account or stop using services). Early detection allows the bank to act proactively ‚Äî offering retention incentives before a customer leaves.

**Dataset:** `Customer-Churn-Records.csv` ‚Äî 10,000 bank customers with 18 attributes including demographics, account information, and a binary target `Exited` (1 = churned, 0 = stayed).

---
### üìã Notebook Roadmap
| Notebook | Content |
|---|---|
| **N1 ‚Üê You are here** | Data upload, shape, types, first look |
| N2 | Exploratory Data Analysis (EDA) |
| N3 | Data Cleaning |
| N4 | Feature Engineering & Preprocessing |
| N5 | Model Training & Selection |
| N6 | Model Saving |
| N7 | Inference Module (deploy-ready) |

In [1]:
# ‚îÄ‚îÄ Standard imports ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
import pandas as pd
import numpy as np

print('pandas  :', pd.__version__)
print('numpy   :', np.__version__)

pandas  : 2.3.3
numpy   : 2.0.1


## 1. Load the Dataset

In [None]:
# pd.read_csv() reads a comma-separated values file into a DataFrame.
# A DataFrame is a 2-D labelled table ‚Äî think of it as a Python-native spreadsheet.

# to run in your terminal, you may need to update the path to your CSV file
data = pd.read_csv('Customer-Churn-Records.csv')


print(f'Dataset loaded. Shape: {data.shape}')   # (rows, columns)
data.head()                                       # preview the first 5 rows

Dataset loaded. Shape: (10000, 18)


Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,Complain,Satisfaction Score,Card Type,Point Earned
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1,1,2,DIAMOND,464
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0,1,3,DIAMOND,456
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1,1,3,DIAMOND,377
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0,0,5,GOLD,350
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0,0,5,GOLD,425


## 2. Dataset Shape

`df.shape` returns a tuple `(n_rows, n_columns)`.  
Our dataset has **10,000 customer records** and **18 features**.

In [5]:
rows, cols = data.shape
print(f'Rows    : {rows:,}')   # number of customer records
print(f'Columns : {cols}')     # number of features (including the target)

Rows    : 10,000
Columns : 18


## 3. Column Data Types & Non-Null Counts

`df.info()` is one of the most useful exploratory tools:
- **Dtype** tells us whether a column is numeric (`int64`, `float64`) or text (`object`).
- **Non-Null Count** tells us immediately if there are missing values (a count < total rows flags a gap).

In [9]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 18 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   RowNumber           10000 non-null  int64  
 1   CustomerId          10000 non-null  int64  
 2   Surname             10000 non-null  object 
 3   CreditScore         10000 non-null  int64  
 4   Geography           10000 non-null  object 
 5   Gender              10000 non-null  object 
 6   Age                 10000 non-null  int64  
 7   Tenure              10000 non-null  int64  
 8   Balance             10000 non-null  float64
 9   NumOfProducts       10000 non-null  int64  
 10  HasCrCard           10000 non-null  int64  
 11  IsActiveMember      10000 non-null  int64  
 12  EstimatedSalary     10000 non-null  float64
 13  Exited              10000 non-null  int64  
 14  Complain            10000 non-null  int64  
 15  Satisfaction Score  10000 non-null  int64  
 16  Card 

**Observations:**
- All 10,000 rows are fully populated ‚Äî **no missing values**.
- 4 text columns: `Surname`, `Geography`, `Gender`, `Card Type` ‚Äî these will need encoding before modelling, except 'Surname' that will be dropped.
- 2 float columns: `Balance`, `EstimatedSalary`.
- 12 integer columns covering identifiers, counts, and binary flags.

## 4. Feature Dictionary

Before any analysis, understanding *what each column means* is critical. Below is the data dictionary:

| Column | Type | Description |
|---|---|---|
| `RowNumber` | int | Row index (no information, will be dropped) |
| `CustomerId` | int | Unique customer identifier (no information, will be dropped) |
| `Surname` | str | Customer surname (no information, will be dropped) |
| `CreditScore` | int | Credit rating score (300‚Äì850) |
| `Geography` | str | Country: France, Germany, Spain |
| `Gender` | str | Male / Female |
| `Age` | int | Customer age in years |
| `Tenure` | int | Years as a bank customer |
| `Balance` | float | Account balance (‚Ç¨) |
| `NumOfProducts` | int | Number of bank products held (1‚Äì4) |
| `HasCrCard` | int | Credit card holder? 1=Yes, 0=No |
| `IsActiveMember` | int | Active member? 1=Yes, 0=No |
| `EstimatedSalary` | float | Estimated annual salary (‚Ç¨) |
| `Exited` | int | **TARGET** ‚Äî Left the bank? 1=Yes (churned), 0=No |
| `Complain` | int | Filed a complaint? 1=Yes, 0=No |
| `Satisfaction Score` | int | Customer satisfaction (1‚Äì5) |
| `Card Type` | str | DIAMOND / GOLD / PLATINUM / SILVER |
| `Point Earned` | int | Loyalty points accumulated |

## 5. Descriptive Statistics

`df.describe()` computes summary statistics for all **numerical** columns:  
count, mean, std, min, quartiles (25%, 50%, 75%), and max.

`.T` transposes the result (columns become rows) which is easier to read when there are many features.

In [7]:
round(data.describe().T, 2)

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
RowNumber,10000.0,5000.5,2886.9,1.0,2500.75,5000.5,7500.25,10000.0
CustomerId,10000.0,15690940.57,71936.19,15565701.0,15628528.25,15690738.0,15753233.75,15815690.0
CreditScore,10000.0,650.53,96.65,350.0,584.0,652.0,718.0,850.0
Age,10000.0,38.92,10.49,18.0,32.0,37.0,44.0,92.0
Tenure,10000.0,5.01,2.89,0.0,3.0,5.0,7.0,10.0
Balance,10000.0,76485.89,62397.41,0.0,0.0,97198.54,127644.24,250898.09
NumOfProducts,10000.0,1.53,0.58,1.0,1.0,1.0,2.0,4.0
HasCrCard,10000.0,0.71,0.46,0.0,0.0,1.0,1.0,1.0
IsActiveMember,10000.0,0.52,0.5,0.0,0.0,1.0,1.0,1.0
EstimatedSalary,10000.0,100090.24,57510.49,11.58,51002.11,100193.92,149388.25,199992.48


**Key takeaways from descriptive statistics:**
- `Exited` mean ‚âà 0.20 ‚Üí roughly 20% of customers churned (class imbalance to address in N4).
- `Balance` has a high std relative to its mean; many customers have a ‚Ç¨0 balance (will explore in N2).
- `Age` ranges from 18 to 92, median around 37 ‚Äî slightly right-skewed.
- `Complain` mean ‚âà 0.20 ‚Äî almost identical to `Exited` mean, suggesting near-perfect correlation (will investigate in N3).

---
### ‚úÖ Notebook 1 Summary
- Dataset loaded: **10,000 rows √ó 18 columns**.
- **No missing values** detected.
- **4 categorical columns** require encoding; **3 identifier columns** will be dropped.
- Target variable `Exited` shows ~20/80 imbalance ‚Üí needs resampling.

‚û°Ô∏è Continue to **N2_ExploratoryDataAnalysis** for visual exploration.