# 🛠️ Adventurer Mart: ML Data Preparation - Part 3

## 🔍 3. Missing Values Handling

This notebook systematically identifies and handles missing values across all tables.

### 🎯 Objectives
- Analyze missing value patterns and causes
- Implement appropriate missing value strategies
- Validate handling effectiveness
- Export cleaned data for next phase

### 🔧 Handling Strategies
- **Drop**: Remove columns/rows with excessive missing data
- **Impute**: Fill missing values with statistical measures
- **Flag**: Create indicator variables for missingness
- **Advanced**: Use domain knowledge for context-specific handling

In [1]:
# Import Required Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pickle
import warnings
from sklearn.impute import SimpleImputer, KNNImputer

warnings.filterwarnings('ignore')

print("📦 Libraries imported successfully!")
print("🔧 Missing value handling tools ready")

📦 Libraries imported successfully!
🔧 Missing value handling tools ready


In [2]:
# Load data from previous phase
print("📂 LOADING DATA FROM PHASE 2")
print("=" * 50)

try:
    # Load dataframes
    with open('data_intermediate/02_dataframes.pkl', 'rb') as f:
        dataframes = pickle.load(f)
    print("✅ Loaded dataframes from Phase 2")
    
    # Load EDA results
    with open('data_intermediate/02_eda_results.pkl', 'rb') as f:
        eda_results = pickle.load(f)
    print("✅ Loaded EDA analysis results")
    
    print(f"\n📊 Dataset Status:")
    print(f"   • Tables: {len(dataframes)}")
    print(f"   • Total missing values: {eda_results['summary_stats']['total_missing']:,}")
    
except FileNotFoundError as e:
    print(f"❌ Error loading data: {e}")
    print("🔄 Please run previous phases first")
    raise

📂 LOADING DATA FROM PHASE 2
✅ Loaded dataframes from Phase 2
✅ Loaded EDA analysis results

📊 Dataset Status:
   • Tables: 9
   • Total missing values: 474


In [3]:
# Comprehensive missing value analysis
print("🔍 MISSING VALUE PATTERN ANALYSIS")
print("=" * 60)

def analyze_missing_patterns(df, table_name):
    """Analyze missing value patterns in detail"""
    print(f"\n📋 Table: {table_name}")
    print("-" * 40)
    
    missing_counts = df.isnull().sum()
    total_missing = missing_counts.sum()
    
    if total_missing == 0:
        print("✅ No missing values found")
        return None
    
    print(f"🕳️ Total missing values: {total_missing:,}")
    
    # Missing values by column
    cols_with_missing = missing_counts[missing_counts > 0]
    print(f"📊 Columns with missing values: {len(cols_with_missing)}")
    
    for col, count in cols_with_missing.items():
        pct = (count / len(df)) * 100
        print(f"   • {col}: {count:,} ({pct:.1f}%)")
    
    # Missing value patterns
    missing_patterns = df.isnull().groupby(list(df.columns)).size().sort_values(ascending=False)
    
    if len(missing_patterns) > 1:
        print(f"\n🔍 Missing patterns found: {len(missing_patterns)}")
        for i, (pattern, count) in enumerate(missing_patterns.head(3).items()):
            missing_cols = [col for col, is_missing in zip(df.columns, pattern) if is_missing]
            if missing_cols:
                print(f"   Pattern {i+1}: {count:,} rows missing {missing_cols}")
    
    return {
        'total_missing': total_missing,
        'columns_with_missing': cols_with_missing.to_dict(),
        'missing_patterns': missing_patterns
    }

# Analyze all tables
missing_analysis = {}
for table_name, df in dataframes.items():
    analysis = analyze_missing_patterns(df, table_name)
    if analysis:
        missing_analysis[table_name] = analysis

print("\n" + "="*60)
print("✅ Missing value analysis completed!")

🔍 MISSING VALUE PATTERN ANALYSIS

📋 Table: details_adventure_gear
----------------------------------------
🕳️ Total missing values: 1
📊 Columns with missing values: 1
   • weight: 1 (0.9%)

🔍 Missing patterns found: 2
   Pattern 2: 1 rows missing ['weight']

📋 Table: details_magic_items
----------------------------------------
✅ No missing values found

📋 Table: details_weapons
----------------------------------------
✅ No missing values found

📋 Table: details_armor
----------------------------------------
🕳️ Total missing values: 17
📊 Columns with missing values: 3
   • ac: 1 (7.7%)
   • requirements: 10 (76.9%)
   • stealth: 6 (46.2%)

🔍 Missing patterns found: 4
   Pattern 1: 5 rows missing ['requirements', 'stealth']
   Pattern 2: 4 rows missing ['requirements']

📋 Table: details_potions
----------------------------------------
✅ No missing values found

📋 Table: details_poisons
----------------------------------------
🕳️ Total missing values: 1
📊 Columns with missing values: 1
  

In [4]:
# Implement missing value handling strategies
print("🔧 IMPLEMENTING MISSING VALUE HANDLING")
print("=" * 60)

def handle_missing_values(df, table_name):
    """Handle missing values with appropriate strategies"""
    print(f"\n🔧 Processing: {table_name}")
    print("-" * 40)
    
    df_cleaned = df.copy()
    handling_log = []
    
    # Check if there are missing values
    if df_cleaned.isnull().sum().sum() == 0:
        print("✅ No missing values to handle")
        return df_cleaned, []
    
    original_shape = df_cleaned.shape
    
    # 1. Drop columns with >50% missing values
    missing_pct = (df_cleaned.isnull().sum() / len(df_cleaned)) * 100
    cols_to_drop = missing_pct[missing_pct > 50].index.tolist()
    
    if cols_to_drop:
        print(f"🗑️ Dropping {len(cols_to_drop)} columns with >50% missing")
        df_cleaned = df_cleaned.drop(columns=cols_to_drop)
        handling_log.append(f"Dropped columns: {cols_to_drop}")
    
    # 2. Drop rows with >80% missing values
    missing_threshold = 0.8 * df_cleaned.shape[1]
    rows_missing = df_cleaned.isnull().sum(axis=1)
    rows_to_drop = rows_missing > missing_threshold
    rows_dropped = rows_to_drop.sum()
    
    if rows_dropped > 0:
        print(f"🗑️ Dropping {rows_dropped} rows with excessive missing values")
        df_cleaned = df_cleaned[~rows_to_drop]
        handling_log.append(f"Dropped {rows_dropped} rows")
    
    # 3. Handle remaining missing values by column type
    remaining_missing = df_cleaned.isnull().sum()
    cols_with_missing = remaining_missing[remaining_missing > 0].index.tolist()
    
    for col in cols_with_missing:
        missing_count = df_cleaned[col].isnull().sum()
        missing_pct = (missing_count / len(df_cleaned)) * 100
        
        print(f"\n🔍 Handling '{col}': {missing_count} missing ({missing_pct:.1f}%)")
        
        # Create indicator column
        indicator_col = f"{col}_was_missing"
        df_cleaned[indicator_col] = df_cleaned[col].isnull().astype(int)
        
        if df_cleaned[col].dtype in ['int64', 'float64', 'int32', 'float32']:
            # Numerical column - use median
            fill_value = df_cleaned[col].median()
            df_cleaned[col] = df_cleaned[col].fillna(fill_value)
            print(f"   📊 Imputed with median: {fill_value:.2f}")
            handling_log.append(f"Numeric imputation - {col}: median={fill_value:.2f}")
            
        else:
            # Categorical column - use mode
            mode_values = df_cleaned[col].mode()
            if len(mode_values) > 0:
                fill_value = mode_values[0]
                df_cleaned[col] = df_cleaned[col].fillna(fill_value)
                print(f"   📊 Imputed with mode: '{fill_value}'")
                handling_log.append(f"Categorical imputation - {col}: mode='{fill_value}'")
            else:
                df_cleaned[col] = df_cleaned[col].fillna('Unknown')
                print(f"   📊 Imputed with 'Unknown'")
                handling_log.append(f"Categorical imputation - {col}: 'Unknown'")
        
        print(f"   🏷️ Created indicator: {indicator_col}")
    
    final_shape = df_cleaned.shape
    print(f"\n📊 Summary:")
    print(f"   Original: {original_shape}")
    print(f"   Final: {final_shape}")
    print(f"   Rows removed: {original_shape[0] - final_shape[0]}")
    
    return df_cleaned, handling_log

# Apply missing value handling to all tables
cleaned_dataframes = {}
all_handling_logs = {}

for table_name, df in dataframes.items():
    cleaned_df, log = handle_missing_values(df, table_name)
    cleaned_dataframes[table_name] = cleaned_df
    all_handling_logs[table_name] = log

print("\n" + "="*60)
print("✅ Missing value handling completed!")

🔧 IMPLEMENTING MISSING VALUE HANDLING

🔧 Processing: details_adventure_gear
----------------------------------------

🔍 Handling 'weight': 1 missing (0.9%)
   📊 Imputed with mode: '1 lb.'
   🏷️ Created indicator: weight_was_missing

📊 Summary:
   Original: (106, 6)
   Final: (106, 7)
   Rows removed: 0

🔧 Processing: details_magic_items
----------------------------------------
✅ No missing values to handle

🔧 Processing: details_weapons
----------------------------------------
✅ No missing values to handle

🔧 Processing: details_armor
----------------------------------------
🗑️ Dropping 1 columns with >50% missing

🔍 Handling 'ac': 1 missing (7.7%)
   📊 Imputed with mode: '11 + Dex'
   🏷️ Created indicator: ac_was_missing

🔍 Handling 'stealth': 6 missing (46.2%)
   📊 Imputed with mode: 'Disadvantage'
   🏷️ Created indicator: stealth_was_missing

📊 Summary:
   Original: (13, 9)
   Final: (13, 10)
   Rows removed: 0

🔧 Processing: details_potions
----------------------------------------


In [5]:
# Validate and visualize results
print("✅ MISSING VALUE HANDLING VALIDATION")
print("=" * 60)

# Create before/after comparison
validation_data = []

for table_name in dataframes.keys():
    original = dataframes[table_name]
    cleaned = cleaned_dataframes[table_name]
    
    validation_data.append({
        'Table': table_name,
        'Original_Rows': original.shape[0],
        'Final_Rows': cleaned.shape[0],
        'Original_Cols': original.shape[1],
        'Final_Cols': cleaned.shape[1],
        'Original_Missing': original.isnull().sum().sum(),
        'Final_Missing': cleaned.isnull().sum().sum(),
        'Indicator_Cols': len([col for col in cleaned.columns if '_was_missing' in col])
    })

validation_df = pd.DataFrame(validation_data)
print("📊 VALIDATION SUMMARY:")
print(validation_df.to_string(index=False))

# Final check
total_remaining_missing = sum(df.isnull().sum().sum() for df in cleaned_dataframes.values())
total_indicators = sum(len([col for col in df.columns if '_was_missing' in col]) for df in cleaned_dataframes.values())

print(f"\n🎯 FINAL RESULTS:")
if total_remaining_missing == 0:
    print(f"   ✅ All missing values handled successfully!")
else:
    print(f"   ⚠️ {total_remaining_missing} missing values remain")

print(f"   📊 Created {total_indicators} missing value indicators")

# Save results
with open('data_intermediate/03_cleaned_dataframes.pkl', 'wb') as f:
    pickle.dump(cleaned_dataframes, f)
print(f"\n💾 Saved cleaned dataframes to data_intermediate/03_cleaned_dataframes.pkl")

with open('data_intermediate/03_handling_logs.pkl', 'wb') as f:
    pickle.dump(all_handling_logs, f)
print(f"✅ Saved handling logs to data_intermediate/03_handling_logs.pkl")

print(f"\n🎯 MISSING VALUES PHASE COMPLETE!")
print(f"   ➡️ Next: Run 04_duplicate_handling.ipynb")

✅ MISSING VALUE HANDLING VALIDATION
📊 VALIDATION SUMMARY:
                 Table  Original_Rows  Final_Rows  Original_Cols  Final_Cols  Original_Missing  Final_Missing  Indicator_Cols
details_adventure_gear            106         106              6           7                 1              0               1
   details_magic_items            199         199              6           6                 0              0               0
       details_weapons             37          37              8           8                 0              0               0
         details_armor             13          13              9          10                17              0               2
       details_potions             22          22              5           5                 0              0               0
       details_poisons             16          16              6           7                 1              0               1
          all_products            393         393           

## 🎉 Phase 3 Complete!

**What we accomplished:**
- ✅ Analyzed missing value patterns across all tables
- ✅ Implemented appropriate handling strategies
- ✅ Created missing value indicator variables
- ✅ Validated handling effectiveness
- ✅ Preserved data integrity throughout process

**Handling Strategies Applied:**
- Dropped columns with >50% missing values
- Removed rows with excessive missing data
- Imputed numerical values with median
- Imputed categorical values with mode
- Created indicator variables for missingness patterns

**Next Steps:**
- Run `04_duplicate_handling.ipynb` to remove duplicate records

**Data Files Created:**
- `data_intermediate/03_cleaned_dataframes.pkl` - DataFrames with handled missing values
- `data_intermediate/03_handling_logs.pkl` - Detailed handling operation logs