# 🛠️ Adventurer Mart: ML Data Preparation - Part 5

## 🧹 5. Categorical Variables Cleaning

This notebook systematically cleans and encodes categorical variables for ML compatibility.

### 🎯 Objectives
- Normalize text and categorical values
- Consolidate similar categories
- Apply appropriate encoding techniques
- Create ML-ready categorical features
- Export encoded data for next phase

### 🔧 Processing Strategy
- **Text Normalization**: Standardize case, spacing, and punctuation
- **Category Consolidation**: Merge similar values and handle rare categories
- **Smart Encoding**: Choose optimal encoding based on cardinality
- **Feature Engineering**: Create meaningful categorical representations

In [1]:
# Import Required Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pickle
import warnings
import re
from collections import Counter
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

warnings.filterwarnings('ignore')

print("📦 Libraries imported successfully!")
print("🧹 Categorical cleaning tools ready")

📦 Libraries imported successfully!
🧹 Categorical cleaning tools ready


In [2]:
# Load data from previous phase
print("📂 LOADING DATA FROM PHASE 4")
print("=" * 50)

try:
    # Load deduplicated dataframes
    with open('data_intermediate/04_deduplicated_dataframes.pkl', 'rb') as f:
        dataframes = pickle.load(f)
    print("✅ Loaded deduplicated dataframes from Phase 4")
    
    print(f"\n📊 Dataset Status:")
    print(f"   • Tables: {len(dataframes)}")
    total_records = sum(len(df) for df in dataframes.values())
    print(f"   • Total records: {total_records:,}")
    
    # Count categorical columns
    total_categorical = sum(len(df.select_dtypes(include=['object']).columns) for df in dataframes.values())
    print(f"   • Categorical columns: {total_categorical}")
    
except FileNotFoundError as e:
    print(f"❌ Error loading data: {e}")
    print("🔄 Please run previous phases first")
    raise

📂 LOADING DATA FROM PHASE 4
✅ Loaded deduplicated dataframes from Phase 4

📊 Dataset Status:
   • Tables: 9
   • Total records: 56,335
   • Categorical columns: 53


In [3]:
# Analyze categorical variables in detail
print("🔍 CATEGORICAL VARIABLES ANALYSIS")
print("=" * 60)

def analyze_categorical_variables(df, table_name):
    """Analyze categorical variables for cleaning strategy"""
    categorical_cols = df.select_dtypes(include=['object']).columns.tolist()
    
    if not categorical_cols:
        print(f"\n📋 {table_name}: No categorical columns found")
        return {}
    
    print(f"\n📋 Table: {table_name}")
    print("-" * 40)
    print(f"📝 Categorical columns: {len(categorical_cols)}")
    
    analysis = {}
    
    for col in categorical_cols:
        print(f"\n🔍 Analyzing: {col}")
        
        unique_count = df[col].nunique()
        total_count = len(df[col].dropna())
        value_counts = df[col].value_counts()
        
        print(f"   📊 Unique values: {unique_count:,}")
        print(f"   📊 Non-null count: {total_count:,}")
        
        # Check for text normalization issues
        sample_values = df[col].dropna().astype(str).unique()[:10]
        
        # Case inconsistencies
        case_variants = {}
        for value in sample_values:
            normalized = value.lower().strip()
            if normalized not in case_variants:
                case_variants[normalized] = []
            case_variants[normalized].append(value)
        
        case_issues = {k: v for k, v in case_variants.items() if len(v) > 1}
        if case_issues:
            print(f"   ⚠️ Case inconsistencies found: {len(case_issues)} groups")
        
        # Cardinality analysis
        cardinality_ratio = unique_count / total_count if total_count > 0 else 0
        
        if cardinality_ratio > 0.8:
            category_type = "High Cardinality"
            encoding_strategy = "Label + Frequency"
        elif unique_count <= 2:
            category_type = "Binary"
            encoding_strategy = "Label Encoding"
        elif unique_count <= 10:
            category_type = "Low Cardinality"
            encoding_strategy = "One-Hot Encoding"
        else:
            category_type = "Medium Cardinality"
            encoding_strategy = "Frequency Encoding"
        
        print(f"   📈 Category type: {category_type}")
        print(f"   🎯 Recommended encoding: {encoding_strategy}")
        
        # Top values
        print(f"   🔝 Top 3 values:")
        for i, (value, count) in enumerate(value_counts.head(3).items()):
            pct = (count / total_count) * 100
            print(f"      {i+1}. '{value}': {count:,} ({pct:.1f}%)")
        
        analysis[col] = {
            'unique_count': unique_count,
            'total_count': total_count,
            'cardinality_ratio': cardinality_ratio,
            'category_type': category_type,
            'encoding_strategy': encoding_strategy,
            'case_issues': case_issues,
            'top_values': value_counts.head(5).to_dict()
        }
    
    return analysis

# Analyze all tables
categorical_analysis = {}
for table_name, df in dataframes.items():
    analysis = analyze_categorical_variables(df, table_name)
    if analysis:
        categorical_analysis[table_name] = analysis

print("\n" + "="*60)
print("✅ Categorical analysis completed!")

🔍 CATEGORICAL VARIABLES ANALYSIS

📋 Table: details_adventure_gear
----------------------------------------
📝 Categorical columns: 6

🔍 Analyzing: item_id
   📊 Unique values: 106
   📊 Non-null count: 106
   📈 Category type: High Cardinality
   🎯 Recommended encoding: Label + Frequency
   🔝 Top 3 values:
      1. '01-Ars': 1 (0.9%)
      2. '83-Rrs': 1 (0.9%)
      3. '81-Rrs': 1 (0.9%)

🔍 Analyzing: name
   📊 Unique values: 105
   📊 Non-null count: 106
   📈 Category type: High Cardinality
   🎯 Recommended encoding: Label + Frequency
   🔝 Top 3 values:
      1. 'Flask or Tankard': 2 (1.9%)
      2. 'Abacus': 1 (0.9%)
      3. 'Parchment (one sheet)': 1 (0.9%)

🔍 Analyzing: price
   📊 Unique values: 25
   📊 Non-null count: 106
   📈 Category type: Medium Cardinality
   🎯 Recommended encoding: Frequency Encoding
   🔝 Top 3 values:
      1. '1 gp': 19 (17.9%)
      2. '2 gp': 15 (14.2%)
      3. '5 gp': 14 (13.2%)

🔍 Analyzing: weight
   📊 Unique values: 19
   📊 Non-null count: 106
   📈 Cate

In [4]:
# Clean and normalize categorical text
print("🧹 CATEGORICAL TEXT NORMALIZATION")
print("=" * 60)

def normalize_categorical_text(df, table_name, analysis):
    """Normalize and clean categorical text"""
    print(f"\n🔧 Processing: {table_name}")
    print("-" * 40)
    
    if not analysis:
        print("✅ No categorical columns to normalize")
        return df.copy(), []
    
    df_normalized = df.copy()
    normalization_log = []
    
    for col, col_analysis in analysis.items():
        print(f"\n🔍 Normalizing: {col}")
        
        original_unique = col_analysis['unique_count']
        
        # Basic text normalization
        df_normalized[col] = df_normalized[col].astype(str)
        df_normalized[col] = df_normalized[col].str.strip().str.lower()
        df_normalized[col] = df_normalized[col].str.replace(r'\s+', ' ', regex=True)
        df_normalized[col] = df_normalized[col].replace('nan', np.nan)
        
        print(f"   ✅ Applied basic text normalization")
        
        # Domain-specific consolidations
        if 'race' in col.lower():
            consolidations = {
                'half-elf': 'half_elf', 'half elf': 'half_elf',
                'half-orc': 'half_orc', 'half orc': 'half_orc',
                'dragonborn': 'dragonborn', 'dragon born': 'dragonborn'
            }
            for old, new in consolidations.items():
                mask = df_normalized[col] == old
                if mask.any():
                    df_normalized.loc[mask, col] = new
                    print(f"   🔄 Consolidated '{old}' → '{new}'")
        
        # Handle rare categories (combine if >10 unique and <1% frequency)
        if original_unique > 10:
            value_counts = df_normalized[col].value_counts()
            total = len(df_normalized[col].dropna())
            rare_threshold = max(1, total * 0.01)
            
            rare_categories = value_counts[value_counts < rare_threshold]
            if len(rare_categories) > 0:
                rare_list = rare_categories.index.tolist()
                df_normalized.loc[df_normalized[col].isin(rare_list), col] = 'other'
                print(f"   📉 Consolidated {len(rare_categories)} rare categories → 'other'")
                normalization_log.append(f"{col}: Consolidated {len(rare_categories)} rare categories")
        
        final_unique = df_normalized[col].nunique()
        reduction = original_unique - final_unique
        print(f"   📊 Categories: {original_unique} → {final_unique} (-{reduction})")
        
        normalization_log.append(f"{col}: Basic normalization applied")
    
    return df_normalized, normalization_log

# Apply normalization to all tables
normalized_dataframes = {}
all_normalization_logs = {}

for table_name, df in dataframes.items():
    analysis = categorical_analysis.get(table_name, {})
    normalized_df, log = normalize_categorical_text(df, table_name, analysis)
    normalized_dataframes[table_name] = normalized_df
    all_normalization_logs[table_name] = log

print("\n" + "="*60)
print("✅ Text normalization completed!")

🧹 CATEGORICAL TEXT NORMALIZATION

🔧 Processing: details_adventure_gear
----------------------------------------

🔍 Normalizing: item_id
   ✅ Applied basic text normalization
   📉 Consolidated 106 rare categories → 'other'
   📊 Categories: 106 → 1 (-105)

🔍 Normalizing: name
   ✅ Applied basic text normalization
   📉 Consolidated 104 rare categories → 'other'
   📊 Categories: 105 → 2 (-103)

🔍 Normalizing: price
   ✅ Applied basic text normalization
   📉 Consolidated 11 rare categories → 'other'
   📊 Categories: 25 → 14 (-11)

🔍 Normalizing: weight
   ✅ Applied basic text normalization
   📉 Consolidated 7 rare categories → 'other'
   📊 Categories: 19 → 13 (-6)

🔍 Normalizing: category
   ✅ Applied basic text normalization
   📊 Categories: 6 → 6 (-0)

🔍 Normalizing: type
   ✅ Applied basic text normalization
   📊 Categories: 1 → 1 (-0)

🔧 Processing: details_magic_items
----------------------------------------

🔍 Normalizing: item_id
   ✅ Applied basic text normalization
   📉 Consolidate

In [5]:
# Apply categorical encoding strategies
print("🎯 CATEGORICAL VARIABLE ENCODING")
print("=" * 60)

def encode_categorical_variables(df, table_name, analysis):
    """Apply optimal encoding strategies to categorical variables"""
    print(f"\n🔧 Encoding variables for: {table_name}")
    print("-" * 40)
    
    if not analysis:
        print("✅ No categorical variables to encode")
        return df.copy(), {}
    
    df_encoded = df.copy()
    encoding_info = {}
    
    for col, col_analysis in analysis.items():
        if col not in df_encoded.columns:
            continue
            
        print(f"\n🔍 Encoding: {col}")
        strategy = col_analysis['encoding_strategy']
        unique_count = df_encoded[col].nunique()
        
        print(f"   📊 Strategy: {strategy}")
        print(f"   📊 Unique values: {unique_count}")
        
        if strategy == "Label Encoding":
            # Binary or ordinal variables
            le = LabelEncoder()
            non_null_mask = df_encoded[col].notna()
            df_encoded.loc[non_null_mask, f"{col}_label"] = le.fit_transform(df_encoded.loc[non_null_mask, col])
            
            encoding_info[col] = {
                'strategy': 'label_encoding',
                'encoder': le,
                'new_columns': [f"{col}_label"]
            }
            print(f"   ✅ Created: {col}_label")
            
        elif strategy == "One-Hot Encoding":
            # Low cardinality variables
            dummies = pd.get_dummies(df_encoded[col], prefix=col, dummy_na=False)
            for dummy_col in dummies.columns:
                df_encoded[dummy_col] = dummies[dummy_col]
            
            encoding_info[col] = {
                'strategy': 'one_hot_encoding',
                'new_columns': dummies.columns.tolist()
            }
            print(f"   ✅ Created: {len(dummies.columns)} dummy variables")
            
        elif strategy == "Frequency Encoding":
            # Medium cardinality variables
            freq_map = df_encoded[col].value_counts().to_dict()
            df_encoded[f"{col}_freq"] = df_encoded[col].map(freq_map)
            
            # Also create label encoding
            le = LabelEncoder()
            non_null_mask = df_encoded[col].notna()
            df_encoded.loc[non_null_mask, f"{col}_label"] = le.fit_transform(df_encoded.loc[non_null_mask, col])
            
            encoding_info[col] = {
                'strategy': 'frequency_encoding',
                'frequency_map': freq_map,
                'encoder': le,
                'new_columns': [f"{col}_freq", f"{col}_label"]
            }
            print(f"   ✅ Created: {col}_freq, {col}_label")
            
        elif strategy == "Label + Frequency":
            # High cardinality variables
            # Frequency encoding
            freq_map = df_encoded[col].value_counts().to_dict()
            df_encoded[f"{col}_freq"] = df_encoded[col].map(freq_map)
            
            # Normalized frequency
            total_count = df_encoded[col].value_counts().sum()
            df_encoded[f"{col}_freq_norm"] = df_encoded[f"{col}_freq"] / total_count
            
            # Label encoding
            le = LabelEncoder()
            non_null_mask = df_encoded[col].notna()
            df_encoded.loc[non_null_mask, f"{col}_label"] = le.fit_transform(df_encoded.loc[non_null_mask, col])
            
            encoding_info[col] = {
                'strategy': 'label_and_frequency',
                'frequency_map': freq_map,
                'encoder': le,
                'new_columns': [f"{col}_freq", f"{col}_freq_norm", f"{col}_label"]
            }
            print(f"   ✅ Created: {col}_freq, {col}_freq_norm, {col}_label")
    
    return df_encoded, encoding_info

# Apply encoding to all tables
encoded_dataframes = {}
all_encoding_info = {}

for table_name, df in normalized_dataframes.items():
    analysis = categorical_analysis.get(table_name, {})
    encoded_df, encoding_info = encode_categorical_variables(df, table_name, analysis)
    encoded_dataframes[table_name] = encoded_df
    all_encoding_info[table_name] = encoding_info

print("\n" + "="*60)
print("✅ Categorical encoding completed!")

🎯 CATEGORICAL VARIABLE ENCODING

🔧 Encoding variables for: details_adventure_gear
----------------------------------------

🔍 Encoding: item_id
   📊 Strategy: Label + Frequency
   📊 Unique values: 1
   ✅ Created: item_id_freq, item_id_freq_norm, item_id_label

🔍 Encoding: name
   📊 Strategy: Label + Frequency
   📊 Unique values: 2
   ✅ Created: name_freq, name_freq_norm, name_label

🔍 Encoding: price
   📊 Strategy: Frequency Encoding
   📊 Unique values: 14
   ✅ Created: price_freq, price_label

🔍 Encoding: weight
   📊 Strategy: Frequency Encoding
   📊 Unique values: 13
   ✅ Created: weight_freq, weight_label

🔍 Encoding: category
   📊 Strategy: One-Hot Encoding
   📊 Unique values: 6
   ✅ Created: 6 dummy variables

🔍 Encoding: type
   📊 Strategy: Label Encoding
   📊 Unique values: 1
   ✅ Created: type_label

🔧 Encoding variables for: details_magic_items
----------------------------------------

🔍 Encoding: item_id
   📊 Strategy: Label + Frequency
   📊 Unique values: 1
   ✅ Created: ite

In [6]:
# Validate and summarize results
print("✅ CATEGORICAL CLEANING VALIDATION")
print("=" * 60)

# Create summary
summary_data = []
total_original_categorical = 0
total_new_encoded = 0

for table_name in dataframes.keys():
    original_categorical = len(dataframes[table_name].select_dtypes(include=['object']).columns)
    final_df = encoded_dataframes[table_name]
    
    # Count new encoded columns
    encoded_cols = [col for col in final_df.columns 
                   if any(suffix in col for suffix in ['_label', '_freq', '_norm']) 
                   or col.startswith(tuple(dataframes[table_name].select_dtypes(include=['object']).columns.tolist() + ['']))]
    
    encoding_info = all_encoding_info.get(table_name, {})
    new_encoded = sum(len(info['new_columns']) for info in encoding_info.values())
    
    summary_data.append({
        'Table': table_name,
        'Original_Categorical': original_categorical,
        'New_Encoded_Cols': new_encoded,
        'Encoding_Strategies': len(encoding_info),
        'Final_Rows': len(final_df)
    })
    
    total_original_categorical += original_categorical
    total_new_encoded += new_encoded

summary_df = pd.DataFrame(summary_data)
print("📊 CATEGORICAL CLEANING SUMMARY:")
print(summary_df.to_string(index=False))

# Overall statistics
print(f"\n🎯 OVERALL RESULTS:")
print(f"   📊 Original categorical columns: {total_original_categorical}")
print(f"   📊 New encoded columns created: {total_new_encoded}")
print(f"   📊 Encoding expansion ratio: {total_new_encoded/max(1, total_original_categorical):.2f}x")

# Data integrity check
print(f"\n🔍 DATA INTEGRITY CHECK:")
all_checks_passed = True
for table_name in dataframes.keys():
    original_rows = len(dataframes[table_name])
    final_rows = len(encoded_dataframes[table_name])
    
    if original_rows == final_rows:
        print(f"   ✅ {table_name}: Row count preserved ({final_rows:,} rows)")
    else:
        print(f"   ❌ {table_name}: Row count changed! {original_rows:,} → {final_rows:,}")
        all_checks_passed = False

if all_checks_passed:
    print(f"\n🎉 All data integrity checks passed!")

# Save results
with open('data_intermediate/05_encoded_dataframes.pkl', 'wb') as f:
    pickle.dump(encoded_dataframes, f)
print(f"\n💾 Saved encoded dataframes to data_intermediate/05_encoded_dataframes.pkl")

with open('data_intermediate/05_encoding_info.pkl', 'wb') as f:
    pickle.dump(all_encoding_info, f)
print(f"✅ Saved encoding info to data_intermediate/05_encoding_info.pkl")

print(f"\n🎯 CATEGORICAL CLEANING PHASE COMPLETE!")
print(f"   ➡️ Next: Run 06_numerical_variables_cleaning.ipynb")

✅ CATEGORICAL CLEANING VALIDATION
📊 CATEGORICAL CLEANING SUMMARY:
                 Table  Original_Categorical  New_Encoded_Cols  Encoding_Strategies  Final_Rows
details_adventure_gear                     6                17                    6         106
   details_magic_items                     6                16                    6         199
       details_weapons                     8                19                    8          37
         details_armor                     8                34                    8          13
       details_potions                     5                12                    5          22
       details_poisons                     5                13                    5          16
          all_products                     4                14                    4         393
             customers                     5                11                    5        1423
                 sales                     6                13        

## 🎉 Phase 5 Complete!

**What we accomplished:**
- ✅ Analyzed categorical variable patterns and cardinality
- ✅ Normalized text data (case, spacing, punctuation)
- ✅ Consolidated similar categories and handled rare values
- ✅ Applied optimal encoding strategies based on variable characteristics
- ✅ Created ML-compatible categorical features

**Encoding Strategies Applied:**
- **Label Encoding**: For binary and ordinal variables
- **One-Hot Encoding**: For low cardinality nominal variables
- **Frequency Encoding**: For medium cardinality variables
- **Combined Encoding**: For high cardinality variables

**Next Steps:**
- Run `06_numerical_variables_cleaning.ipynb` to clean and optimize numerical variables

**Data Files Created:**
- `data_intermediate/05_encoded_dataframes.pkl` - DataFrames with encoded categorical variables
- `data_intermediate/05_encoding_info.pkl` - Detailed encoding strategy information