# Customer Segmentation - FIXED VERSION - Part 3

## Clustering & Business Insights

This notebook continues from Part 2 with feature preparation, clustering, and interpretation.

<h2 style="color:darkmagenta;text-align: center; background-color: AliceBlue;padding: 20px;">7. Feature Preparation for Clustering</h2><a id="7"></a>

### 🔴 CRITICAL FIX: Proper Handling of Categorical Variables

**The Original Error:**
```python
# WRONG - Treating categorical as numeric!
df['CustGender'] = df['CustGender'].map({'M': 1, 'F': 0})
df_scaled = StandardScaler().fit_transform(df)  # Scales the 0/1 values!
```

**Problems:**
1. Gender is NOMINAL (no order), not ordinal
2. M=1, F=0 implies M > F (wrong!)
3. StandardScaler transforms 0/1 to different values, losing meaning

**The Correct Approach:**
```python
# CORRECT - One-hot encoding for categorical
# Separate handling for numerical and categorical
categorical → One-Hot Encoding → Keep as is (already 0/1)
numerical → StandardScaler → Normalize for equal weighting
```

---

### Why Feature Scaling Matters:

K-Means uses **Euclidean distance**, so feature scales matter!

**Example:**
- Customer A: Recency=5 days, MonetaryTotal=₹10,000
- Customer B: Recency=10 days, MonetaryTotal=₹15,000

**Without scaling:**
```
Distance = √[(5-10)² + (10000-15000)²] = √[25 + 25,000,000] ≈ 5,000
```
Monetary dominates! Recency difference (5 days) is ignored.

**With scaling:**
```
Both features normalized to similar ranges
Both contribute equally to distance calculation
```

In [None]:
# Load customer dataset from Part 2
customer_df = pd.read_csv('customer_rfm_features.csv')

print("📊 Customer Dataset Loaded")
print("=" * 80)
print(f"Shape: {customer_df.shape}")
print(f"\nColumns: {list(customer_df.columns)}")

In [None]:
# Define features for clustering
print("🎯 Selecting Features for Clustering")
print("=" * 80)

# Numerical features - will be scaled
numerical_features = [
    'Recency',           # Days since last transaction
    'Frequency',         # Number of transactions
    'MonetaryTotal',     # Total spending
    'MonetaryAvg',       # Average transaction amount
    'AccountBalance',    # Current account balance
    'Age'                # Customer age
]

# Categorical features - will be one-hot encoded
categorical_features = [
    'Gender'             # M/F
]

# Features to exclude from clustering
exclude_features = [
    'CustomerID',         # Identifier (not a feature)
    'LastTransactionDate',# Already captured in Recency
    'Location',           # Too many categories (800+ cities)
    'MonetaryStd',        # Volatility - can add if desired
    'AccountBalanceAvg'   # Redundant with AccountBalance
]

print(f"\nNumerical features ({len(numerical_features)}):")
for feat in numerical_features:
    print(f"  ✓ {feat}")

print(f"\nCategorical features ({len(categorical_features)}):")
for feat in categorical_features:
    print(f"  ✓ {feat}")

print(f"\nExcluded features ({len(exclude_features)}):")
for feat in exclude_features:
    print(f"  ✗ {feat}")

### Step 1: Handle Categorical Variables - ONE-HOT ENCODING

**What is One-Hot Encoding?**

Converts categorical variables into binary columns:

```
Before:              After:
Gender               Gender_F    Gender_M
------               --------    --------
  M                     0           1
  F                     1           0
  M                     0           1
```

**Why?**
- No artificial ordering (M ≠ 1 > F = 0)
- Each category equally important
- Works with any ML algorithm

**Note:** We use `drop_first=True` to avoid multicollinearity
- If Gender_F = 0, then Gender_M = 1 (redundant)
- Keep only one column (e.g., Gender_M where 1=Male, 0=Female)

In [None]:
# One-hot encode categorical variables
print("🔧 One-Hot Encoding Categorical Variables")
print("=" * 80)

# Before encoding
print("\nBefore encoding:")
print(customer_df[categorical_features].head())
print(f"\nGender distribution:")
print(customer_df['Gender'].value_counts())

# Apply one-hot encoding
df_categorical = pd.get_dummies(
    customer_df[categorical_features],
    drop_first=True,  # Avoid multicollinearity
    prefix='Gender'
)

# After encoding
print("\nAfter encoding:")
print(df_categorical.head())
print(f"\nNew columns: {list(df_categorical.columns)}")
print(f"\nGender_M = 1 means Male")
print(f"Gender_M = 0 means Female")

print("\n✓ Categorical variables encoded")

### Step 2: Scale Numerical Variables - STANDARDIZATION

**What is StandardScaler?**

Transforms each feature to have:
- Mean = 0
- Standard Deviation = 1

**Formula:**
```
scaled_value = (original_value - mean) / std_deviation
```

**Example:**
```
Recency: [1, 5, 10, 30, 365] → [-0.5, -0.4, -0.3, 0.2, 2.8]
All features now on comparable scales!
```

**Why StandardScaler (not MinMaxScaler)?**
- Works better with outliers
- Preserves information about extremes
- Standard choice for K-Means

In [None]:
# Scale numerical features
print("📏 Scaling Numerical Features (StandardScaler)")
print("=" * 80)

# Before scaling
print("\nBefore scaling (sample):")
print(customer_df[numerical_features].head())
print("\nStatistics before scaling:")
print(customer_df[numerical_features].describe().T[['mean', 'std']])

# Apply StandardScaler
scaler = StandardScaler()
df_numerical_scaled = pd.DataFrame(
    scaler.fit_transform(customer_df[numerical_features]),
    columns=numerical_features,
    index=customer_df.index
)

# After scaling
print("\n" + "=" * 80)
print("After scaling (sample):")
print(df_numerical_scaled.head())
print("\nStatistics after scaling:")
desc_stats = df_numerical_scaled.describe().T[['mean', 'std']].round(6)
print(desc_stats)
print("\n✓ All features now have mean ≈ 0 and std ≈ 1")

print("\n💾 Saving scaler for future use...")
joblib.dump(scaler, 'models/standard_scaler.pkl')
print("   Scaler saved to 'models/standard_scaler.pkl'")
print("   Use this to scale new customers in production!")

### Step 3: Combine Features - FINAL DATASET FOR CLUSTERING

**Final dataset combines:**
- Scaled numerical features (mean=0, std=1)
- One-hot encoded categorical features (0 or 1)

**Note:** Categorical features NOT scaled because:
- Already binary (0 or 1)
- Scaling would distort their meaning
- They represent presence/absence, not magnitude

In [None]:
# Combine scaled numerical and one-hot encoded categorical
print("🔗 Combining Features")
print("=" * 80)

df_clustering = pd.concat([
    df_numerical_scaled,  # Scaled numerical
    df_categorical        # One-hot encoded categorical
], axis=1)

print(f"\nFinal dataset for clustering:")
print(f"  Shape: {df_clustering.shape}")
print(f"  Rows (customers): {df_clustering.shape[0]:,}")
print(f"  Columns (features): {df_clustering.shape[1]}")
print(f"\nFeatures:")
for col in df_clustering.columns:
    print(f"  ✓ {col}")

print("\nSample data:")
df_clustering.head()

### Step 4: Sampling Strategy (Optional)

**The Dilemma:**
- Full dataset: 800K+ customers (computationally expensive)
- Sample: Faster, but may miss patterns

**Options:**

1. **Random Sampling** (Current approach)
   - Pros: Fast, simple
   - Cons: May miss rare segments

2. **Stratified Sampling**
   - Pros: Preserves proportions
   - Cons: Need stratification variable

3. **Mini-Batch K-Means** (Best for large data)
   - Pros: Handles full dataset efficiently
   - Cons: Slightly less accurate

**Our Choice:** Random sampling of 100K customers for speed
- Represents ~12.5% of data
- Still statistically significant
- Faster iteration for demonstration

In [None]:
# Sampling (if configured)
if CONFIG.get('USE_SAMPLING', True) and CONFIG.get('SAMPLE_SIZE'):
    print("📊 Sampling Strategy")
    print("=" * 80)
    print(f"Full dataset size: {len(df_clustering):,} customers")
    
    sample_size = min(CONFIG['SAMPLE_SIZE'], len(df_clustering))
    df_final = df_clustering.sample(
        n=sample_size,
        random_state=CONFIG['RANDOM_STATE']
    ).reset_index(drop=True)
    
    print(f"Sample size: {len(df_final):,} customers")
    print(f"Sampling rate: {len(df_final)/len(df_clustering)*100:.1f}%")
    print(f"\n⚠️  Note: Using sample for faster computation")
    print(f"   For production, consider Mini-Batch K-Means on full dataset")
else:
    print("Using full dataset (no sampling)")
    df_final = df_clustering.copy()

print(f"\n✓ Final dataset ready: {len(df_final):,} customers × {df_final.shape[1]} features")

In [None]:
# Save categorical columns list for later reference
categorical_cols = list(df_categorical.columns)
joblib.dump(categorical_cols, 'models/categorical_columns.pkl')
print("💾 Categorical column names saved")

### ✅ Feature Preparation Complete!

**What we accomplished:**
1. ✓ Separated numerical and categorical features
2. ✓ One-hot encoded categorical variables (NO fake ordering!)
3. ✓ Scaled numerical variables using StandardScaler
4. ✓ Combined features properly
5. ✓ Applied sampling strategy
6. ✓ Saved preprocessing objects for production use

**Critical Fixes Applied:**
- ❌ Original: `Gender → {M:1, F:0} → StandardScaler()` (WRONG!)
- ✅ Fixed: `Gender → One-Hot → Keep binary` (CORRECT!)

**Ready for clustering!**