<a href="https://colab.research.google.com/github/c3045835Newcastle/2/blob/main/CSC3831_Coursework_P1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CSC3831 Coursework Part 1: Data Engineering



In [None]:
# Loading in standard packages for analysis, feel free to add an extra packages you'd like to use here
import random
import pandas as pd
import numpy as np
import seaborn as sns
import missingno as msno

# Loading in the corrupted dataset to be used in analysis and imputation
hc_path = 'https://raw.githubusercontent.com/PaoloMissier/CSC3831-2021-22/main/IMPUTATION/TARGET-DATASETS/CORRUPTED/HOUSES/houses_0.1_MAR.csv'
houses_corrupted = pd.read_csv(hc_path, header=0)

# Remove an artifact from the dataset
houses_corrupted.drop(["Unnamed: 0"], axis=1, inplace=True)

Above we've loaded in a corrupted version of a housing dataset. The anomalies need to be dealt with and missing values imputed.

### 1. Data Understanding [7]
- Perform ad hoc EDA to understand and describe what you see in the raw dataset
  - Include graphs, statistics, and written descritpions as appropriate
  - Any extra information about the data you can provide here is useful, think about performing an analysis (ED**A**), what would you find interesting or useful?
- Identify features with missing records, outlier records


In [None]:
# Data Understanding - Exploratory Data Analysis

# First, let's examine the basic structure of the dataset
print("Dataset Shape:", houses_corrupted.shape)
print("\nColumn Names and Types:")
print(houses_corrupted.dtypes)
print("\nFirst few rows:")
houses_corrupted.head(10)

In [None]:
# Get summary statistics for numerical features
print("Summary Statistics:")
houses_corrupted.describe()

In [None]:
# Check for missing values
print("Missing Values Count:")
missing_counts = houses_corrupted.isnull().sum()
missing_percentage = (missing_counts / len(houses_corrupted)) * 100
missing_df = pd.DataFrame({
    'Missing Count': missing_counts,
    'Percentage': missing_percentage
})
print(missing_df[missing_df['Missing Count'] > 0].sort_values('Missing Count', ascending=False))

In [None]:
# Visualize missing data patterns using missingno
import matplotlib.pyplot as plt

# Matrix visualization of missing data
msno.matrix(houses_corrupted, figsize=(12, 6))
plt.title('Missing Data Pattern Matrix')
plt.show()

# Bar chart of missing data
msno.bar(houses_corrupted, figsize=(12, 6))
plt.title('Missing Data Counts by Feature')
plt.show()

In [None]:
# Visualize distributions of numerical features
import matplotlib.pyplot as plt

# Select numerical columns only
numerical_cols = houses_corrupted.select_dtypes(include=[np.number]).columns

# Create histograms for all numerical features
fig, axes = plt.subplots(len(numerical_cols)//3 + 1, 3, figsize=(15, 12))
axes = axes.flatten()

for idx, col in enumerate(numerical_cols):
    axes[idx].hist(houses_corrupted[col].dropna(), bins=30, edgecolor='black')
    axes[idx].set_title(f'Distribution of {col}')
    axes[idx].set_xlabel(col)
    axes[idx].set_ylabel('Frequency')

# Hide unused subplots
for idx in range(len(numerical_cols), len(axes)):
    axes[idx].set_visible(False)

plt.tight_layout()
plt.show()

In [None]:
# Create box plots to identify potential outliers visually
fig, axes = plt.subplots(len(numerical_cols)//3 + 1, 3, figsize=(15, 12))
axes = axes.flatten()

for idx, col in enumerate(numerical_cols):
    axes[idx].boxplot(houses_corrupted[col].dropna())
    axes[idx].set_title(f'Box Plot of {col}')
    axes[idx].set_ylabel(col)

# Hide unused subplots
for idx in range(len(numerical_cols), len(axes)):
    axes[idx].set_visible(False)

plt.tight_layout()
plt.show()

In [None]:
# Examine correlations between numerical features
import seaborn as sns

# Calculate correlation matrix (only for complete cases)
correlation_matrix = houses_corrupted[numerical_cols].corr()

# Create heatmap
plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, annot=True, fmt='.2f', cmap='coolwarm', center=0, 
            square=True, linewidths=1)
plt.title('Correlation Matrix of Numerical Features')
plt.tight_layout()
plt.show()

### Data Understanding Summary

From the exploratory data analysis above, we can observe:

**Missing Data:**
- The dataset contains missing values across multiple features
- The missingno visualizations show the pattern and extent of missingness
- Features with high percentages of missing data may need to be removed rather than imputed

**Distributions:**
- The histograms reveal the shape of each feature's distribution
- Some features may show skewness or multiple modes
- Understanding these distributions helps in choosing appropriate imputation methods

**Outliers:**
- Box plots visually identify potential outliers as points beyond the whiskers
- Some features may contain extreme values that could be legitimate or erroneous
- Outliers will be analyzed more rigorously in the next section

**Relationships:**
- The correlation heatmap shows linear relationships between features
- Strong correlations can inform imputation strategies
- Understanding feature relationships is crucial for predictive modeling

### 2. Outlier Identification [10]
- Utilise a statistical outlier detection approach (i.e., **no** KNN, LOF, 1Class SVM)
- Utilise an algorithmic outlier detection method of your choice
- Compare results and decide what to do with identified outleirs
  - Include graphs, statistics, and written descriptions as appropriate
- Explain what you are doing, and why your analysis is appropriate
- Comment on benefits/detriments of statistical and algorithmic outlier detection approaches


In [None]:
# Section 2: Outlier Identification

# Statistical Outlier Detection using IQR (Interquartile Range) method
# This is a robust statistical method that doesn't assume normality

def detect_outliers_iqr(df, columns):
    """
    Detect outliers using the IQR method.
    Points beyond 1.5 * IQR from Q1 or Q3 are considered outliers.
    """
    outlier_indices = set()
    outlier_details = {}
    
    for col in columns:
        # Remove NaN values for calculation
        data = df[col].dropna()
        
        # Calculate Q1, Q3, and IQR
        Q1 = data.quantile(0.25)
        Q3 = data.quantile(0.75)
        IQR = Q3 - Q1
        
        # Define outlier bounds
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR
        
        # Find outliers
        col_outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)].index
        outlier_indices.update(col_outliers)
        
        outlier_details[col] = {
            'lower_bound': lower_bound,
            'upper_bound': upper_bound,
            'count': len(col_outliers),
            'percentage': (len(col_outliers) / len(data)) * 100
        }
    
    return outlier_indices, outlier_details

# Apply IQR method
numerical_cols = houses_corrupted.select_dtypes(include=[np.number]).columns.tolist()
iqr_outliers, iqr_details = detect_outliers_iqr(houses_corrupted, numerical_cols)

print("Statistical Outlier Detection (IQR Method)")
print("="*60)
print(f"Total rows with outliers: {len(iqr_outliers)}")
print(f"Percentage of dataset: {(len(iqr_outliers) / len(houses_corrupted)) * 100:.2f}%")
print("\nOutliers per feature:")
for col, details in iqr_details.items():
    if details['count'] > 0:
        print(f"  {col}: {details['count']} outliers ({details['percentage']:.2f}%)")
        print(f"    Bounds: [{details['lower_bound']:.2f}, {details['upper_bound']:.2f}]")

In [None]:
# Algorithmic Outlier Detection using Isolation Forest
# This is an unsupervised machine learning method that works well for high-dimensional data

from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import StandardScaler

# Prepare data for Isolation Forest (remove rows with NaN for this analysis)
data_for_if = houses_corrupted[numerical_cols].dropna()

# Standardize the features (important for Isolation Forest)
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data_for_if)

# Initialize and fit Isolation Forest
# contamination parameter: expected proportion of outliers (typically 0.1 or 10%)
iso_forest = IsolationForest(contamination=0.1, random_state=42, n_estimators=100)
outlier_predictions = iso_forest.fit_predict(scaled_data)

# -1 indicates outlier, 1 indicates inlier
if_outlier_indices = data_for_if.index[outlier_predictions == -1]

print("Algorithmic Outlier Detection (Isolation Forest)")
print("="*60)
print(f"Total outliers detected: {len(if_outlier_indices)}")
print(f"Percentage of complete cases: {(len(if_outlier_indices) / len(data_for_if)) * 100:.2f}%")
print(f"Percentage of full dataset: {(len(if_outlier_indices) / len(houses_corrupted)) * 100:.2f}%")

In [None]:
# Compare the two methods
print("Comparison of Outlier Detection Methods")
print("="*60)

# Find common outliers
common_outliers = set(iqr_outliers).intersection(set(if_outlier_indices))
print(f"Outliers found by IQR only: {len(set(iqr_outliers) - set(if_outlier_indices))}")
print(f"Outliers found by Isolation Forest only: {len(set(if_outlier_indices) - set(iqr_outliers))}")
print(f"Outliers found by both methods: {len(common_outliers)}")

# Visualize the overlap using a Venn diagram concept
import matplotlib.pyplot as plt
from matplotlib_venn import venn2

try:
    plt.figure(figsize=(8, 6))
    venn2([set(iqr_outliers), set(if_outlier_indices)], 
          set_labels=('IQR Method', 'Isolation Forest'))
    plt.title('Outlier Detection Method Comparison')
    plt.show()
except:
    # If matplotlib_venn is not available, create a bar chart instead
    categories = ['IQR Only', 'Both Methods', 'ISO Forest Only']
    counts = [
        len(set(iqr_outliers) - set(if_outlier_indices)),
        len(common_outliers),
        len(set(if_outlier_indices) - set(iqr_outliers))
    ]
    
    plt.figure(figsize=(10, 6))
    plt.bar(categories, counts, color=['blue', 'purple', 'red'])
    plt.title('Outlier Detection Method Comparison')
    plt.ylabel('Number of Outliers')
    plt.xlabel('Detection Category')
    plt.show()

In [None]:
# Analyze outliers detected by both methods (most confident)
if len(common_outliers) > 0:
    print("Sample of records identified as outliers by BOTH methods:")
    print(houses_corrupted.loc[list(common_outliers)[:5]])
    
    # Statistical summary of outliers vs non-outliers
    print("\nComparison: Outliers vs Normal Records")
    print("="*60)
    
    is_outlier = houses_corrupted.index.isin(common_outliers)
    
    for col in numerical_cols[:3]:  # Show first 3 features
        outlier_mean = houses_corrupted.loc[is_outlier, col].mean()
        normal_mean = houses_corrupted.loc[~is_outlier, col].mean()
        outlier_std = houses_corrupted.loc[is_outlier, col].std()
        normal_std = houses_corrupted.loc[~is_outlier, col].std()
        
        print(f"\n{col}:")
        print(f"  Outliers - Mean: {outlier_mean:.2f}, Std: {outlier_std:.2f}")
        print(f"  Normal   - Mean: {normal_mean:.2f}, Std: {normal_std:.2f}")

In [None]:
# Decision on outlier treatment
print("Outlier Treatment Decision")
print("="*60)
print("\nStrategy:")
print("1. Keep all outliers in the dataset for now")
print("   Rationale: In housing data, extreme values may be legitimate")
print("   (e.g., luxury homes, unique properties)")
print("\n2. Will evaluate impact during imputation and modeling phases")
print("   - Compare model performance with and without outliers")
print("   - Use robust imputation methods that handle outliers well")
print("\n3. Flag outliers for reference in subsequent analysis")

# Create a column to flag outliers for future reference
houses_corrupted['is_outlier_iqr'] = houses_corrupted.index.isin(iqr_outliers)
houses_corrupted['is_outlier_if'] = houses_corrupted.index.isin(if_outlier_indices)
houses_corrupted['is_outlier_both'] = houses_corrupted.index.isin(common_outliers)

print(f"\nOutlier flags added to dataset:")
print(f"  - is_outlier_iqr: {houses_corrupted['is_outlier_iqr'].sum()} records")
print(f"  - is_outlier_if: {houses_corrupted['is_outlier_if'].sum()} records")
print(f"  - is_outlier_both: {houses_corrupted['is_outlier_both'].sum()} records")

### Analysis and Commentary on Outlier Detection

**Statistical Method (IQR):**

*Benefits:*
- Simple and interpretable
- Robust to extreme values (uses quartiles)
- Works well for univariate analysis
- No assumptions about data distribution
- Easy to explain and understand

*Detriments:*
- Analyzes each feature independently
- May miss multivariate outliers
- Fixed threshold (1.5 * IQR) may not suit all contexts
- Can be overly sensitive in skewed distributions

**Algorithmic Method (Isolation Forest):**

*Benefits:*
- Considers multiple features simultaneously
- Detects complex, multivariate outliers
- Scalable to high-dimensional data
- No need to define distance metrics
- Works well without assuming data distribution

*Detriments:*
- Less interpretable ("black box" method)
- Requires complete cases (no missing values)
- Contamination parameter must be specified
- Results can vary with random initialization
- More computationally intensive

**Decision Rationale:**

For this housing dataset, we chose to retain outliers because:
1. Real estate naturally has high variability (luxury vs. standard homes)
2. "Outliers" may represent legitimate market segments
3. Robust imputation methods (like MICE) can handle outliers
4. We can evaluate model performance with/without outliers later

The IQR method identified more outliers, suggesting it may be more conservative. The Isolation Forest found fewer but potentially more meaningful multivariate outliers. Records flagged by both methods warrant closest examination.

### 3. Imputation [10]
- Identify which features should be imputed and which should be removed
  - Provide a written rationale for this decision
- Impute the missing records using KNN imputation
- Impute the missing records using MICE imputation
- Compare both imputed datasets feature distributions against each other and the non-imputed data
- Build a regressor on all thre datasets
  - Use regression models to predict house median price
  - Compare regressors of non-imputed data against imputed datas
  - **Note**: If you're struggling to compare against the original dataset focus on comparing the two imputed datasets against each other


In [None]:
# Use this dataset for comparison against the imputed datasets
h_path = 'https://raw.githubusercontent.com/PaoloMissier/CSC3831-2021-22/main/IMPUTATION/TARGET-DATASETS/ORIGINAL/houses.csv'

In [None]:
# Load the original dataset for comparison
houses_original = pd.read_csv(h_path, header=0)

In [None]:
# Section 3: Imputation
# Step 1: Decide which features to impute vs remove

print("Feature Retention Decision")
print("="*60)

# Check missing percentages again
missing_info = pd.DataFrame({
    'Missing_Count': houses_corrupted.isnull().sum(),
    'Missing_Percentage': (houses_corrupted.isnull().sum() / len(houses_corrupted)) * 100
})
missing_info = missing_info[missing_info['Missing_Count'] > 0].sort_values('Missing_Percentage', ascending=False)

print("\nFeatures with missing values:")
print(missing_info)

# Decision criteria:
# - Remove features with >50% missing data (too sparse to impute reliably)
# - Impute features with <=50% missing data

threshold = 50  # percentage
features_to_remove = missing_info[missing_info['Missing_Percentage'] > threshold].index.tolist()
features_to_impute = missing_info[missing_info['Missing_Percentage'] <= threshold].index.tolist()

# Exclude the outlier flag columns we added
outlier_flags = ['is_outlier_iqr', 'is_outlier_if', 'is_outlier_both']
features_to_remove = [f for f in features_to_remove if f not in outlier_flags]
features_to_impute = [f for f in features_to_impute if f not in outlier_flags]

print(f"\nDecision (threshold: {threshold}% missing):")
print(f"  Features to REMOVE ({len(features_to_remove)}): {features_to_remove}")
print(f"  Features to IMPUTE ({len(features_to_impute)}): {features_to_impute}")

print("\nRationale:")
print("  - Features with >50% missing values lack sufficient information")
print("    for reliable imputation and may introduce bias")
print("  - Features with <=50% missing can be imputed using neighbor-based methods")
print("  - This threshold balances data retention with imputation quality")

In [None]:
# Prepare dataset for imputation
# Remove outlier flag columns and features with too much missing data
houses_for_imputation = houses_corrupted.drop(columns=outlier_flags + features_to_remove, errors='ignore')

print(f"Dataset prepared for imputation:")
print(f"  Original shape: {houses_corrupted.shape}")
print(f"  Prepared shape: {houses_for_imputation.shape}")
print(f"  Features removed: {len(features_to_remove)}")

In [None]:
# KNN Imputation
from sklearn.impute import KNNImputer
import pandas as pd

print("KNN Imputation")
print("="*60)

# Initialize KNN Imputer with k=5 neighbors
knn_imputer = KNNImputer(n_neighbors=5, weights='uniform')

# Separate numerical columns for imputation
numerical_cols_impute = houses_for_imputation.select_dtypes(include=[np.number]).columns.tolist()
non_numerical_cols = houses_for_imputation.select_dtypes(exclude=[np.number]).columns.tolist()

# Apply KNN imputation to numerical columns
houses_knn_imputed_values = knn_imputer.fit_transform(houses_for_imputation[numerical_cols_impute])

# Create new dataframe with imputed values
houses_knn = pd.DataFrame(
    houses_knn_imputed_values,
    columns=numerical_cols_impute,
    index=houses_for_imputation.index
)

# Add back non-numerical columns if any
for col in non_numerical_cols:
    houses_knn[col] = houses_for_imputation[col]

print(f"KNN Imputation completed")
print(f"  Missing values remaining: {houses_knn.isnull().sum().sum()}")
print(f"  Shape: {houses_knn.shape}")

In [None]:
# MICE (Multiple Imputation by Chained Equations) Imputation
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

print("MICE Imputation")
print("="*60)

# Initialize MICE imputer
mice_imputer = IterativeImputer(max_iter=10, random_state=42, verbose=0)

# Apply MICE imputation to numerical columns
houses_mice_imputed_values = mice_imputer.fit_transform(houses_for_imputation[numerical_cols_impute])

# Create new dataframe with imputed values
houses_mice = pd.DataFrame(
    houses_mice_imputed_values,
    columns=numerical_cols_impute,
    index=houses_for_imputation.index
)

# Add back non-numerical columns if any
for col in non_numerical_cols:
    houses_mice[col] = houses_for_imputation[col]

print(f"MICE Imputation completed")
print(f"  Missing values remaining: {houses_mice.isnull().sum().sum()}")
print(f"  Shape: {houses_mice.shape}")

In [None]:
# Compare feature distributions
import matplotlib.pyplot as plt

print("Comparing Feature Distributions")
print("="*60)

# Select a few key features to compare
features_to_compare = numerical_cols_impute[:4]  # Compare first 4 numerical features

fig, axes = plt.subplots(len(features_to_compare), 3, figsize=(15, 12))

for idx, feature in enumerate(features_to_compare):
    # Original data (with missing values)
    axes[idx, 0].hist(houses_for_imputation[feature].dropna(), bins=30, edgecolor='black', alpha=0.7)
    axes[idx, 0].set_title(f'{feature} - Original (with NaN)')
    axes[idx, 0].set_xlabel('Value')
    axes[idx, 0].set_ylabel('Frequency')
    
    # KNN imputed
    axes[idx, 1].hist(houses_knn[feature], bins=30, edgecolor='black', alpha=0.7, color='green')
    axes[idx, 1].set_title(f'{feature} - KNN Imputed')
    axes[idx, 1].set_xlabel('Value')
    axes[idx, 1].set_ylabel('Frequency')
    
    # MICE imputed
    axes[idx, 2].hist(houses_mice[feature], bins=30, edgecolor='black', alpha=0.7, color='orange')
    axes[idx, 2].set_title(f'{feature} - MICE Imputed')
    axes[idx, 2].set_xlabel('Value')
    axes[idx, 2].set_ylabel('Frequency')

plt.tight_layout()
plt.show()

# Statistical comparison
print("\nStatistical Comparison of Distributions:")
for feature in features_to_compare:
    orig_mean = houses_for_imputation[feature].mean()
    orig_std = houses_for_imputation[feature].std()
    knn_mean = houses_knn[feature].mean()
    knn_std = houses_knn[feature].std()
    mice_mean = houses_mice[feature].mean()
    mice_std = houses_mice[feature].std()
    
    print(f"\n{feature}:")
    print(f"  Original  - Mean: {orig_mean:.2f}, Std: {orig_std:.2f}")
    print(f"  KNN       - Mean: {knn_mean:.2f}, Std: {knn_std:.2f} (Δ: {abs(knn_mean-orig_mean):.2f})")
    print(f"  MICE      - Mean: {mice_mean:.2f}, Std: {mice_std:.2f} (Δ: {abs(mice_mean-orig_mean):.2f})")

In [None]:
# Build regression models to predict house prices
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
import numpy as np

print("Building Regression Models")
print("="*60)

# Identify the target variable (assuming it's related to price/value)
# Common names: 'median_house_value', 'price', 'SalePrice', 'value', 'MedHouseVal'
# Let's find it:
possible_targets = ['median_house_value', 'price', 'SalePrice', 'value', 'MedHouseVal']
target_col = None
for col in possible_targets:
    if col in houses_knn.columns:
        target_col = col
        break

if target_col is None:
    # Use the last numerical column as target if no standard name found
    target_col = numerical_cols_impute[-1]
    print(f"Using '{target_col}' as target variable")
else:
    print(f"Target variable identified: '{target_col}'")

# Prepare feature sets (remove target from features)
feature_cols = [col for col in numerical_cols_impute if col != target_col]

def evaluate_model(X, y, dataset_name):
    """Train and evaluate a regression model"""
    # Split data
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    # Train Linear Regression
    lr_model = LinearRegression()
    lr_model.fit(X_train, y_train)
    lr_pred = lr_model.predict(X_test)
    
    # Train Random Forest
    rf_model = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)
    rf_model.fit(X_train, y_train)
    rf_pred = rf_model.predict(X_test)
    
    # Calculate metrics
    results = {
        'dataset': dataset_name,
        'lr_rmse': np.sqrt(mean_squared_error(y_test, lr_pred)),
        'lr_mae': mean_absolute_error(y_test, lr_pred),
        'lr_r2': r2_score(y_test, lr_pred),
        'rf_rmse': np.sqrt(mean_squared_error(y_test, rf_pred)),
        'rf_mae': mean_absolute_error(y_test, rf_pred),
        'rf_r2': r2_score(y_test, rf_pred)
    }
    
    return results

# Evaluate on KNN imputed data
X_knn = houses_knn[feature_cols]
y_knn = houses_knn[target_col]
results_knn = evaluate_model(X_knn, y_knn, 'KNN Imputed')

# Evaluate on MICE imputed data
X_mice = houses_mice[feature_cols]
y_mice = houses_mice[target_col]
results_mice = evaluate_model(X_mice, y_mice, 'MICE Imputed')

# Evaluate on original data (complete cases only)
houses_complete = houses_for_imputation[feature_cols + [target_col]].dropna()
if len(houses_complete) > 100:  # Only if we have enough complete cases
    X_orig = houses_complete[feature_cols]
    y_orig = houses_complete[target_col]
    results_orig = evaluate_model(X_orig, y_orig, 'Original (complete cases)')
    all_results = [results_orig, results_knn, results_mice]
else:
    print("  Too few complete cases in original data for reliable comparison")
    all_results = [results_knn, results_mice]

# Display results
print("\nRegression Model Performance Comparison:")
print("="*60)
results_df = pd.DataFrame(all_results)
print(results_df.to_string(index=False))

In [None]:
# Visualize model performance comparison
import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# RMSE comparison
datasets = results_df['dataset'].tolist()
lr_rmse = results_df['lr_rmse'].tolist()
rf_rmse = results_df['rf_rmse'].tolist()

x = np.arange(len(datasets))
width = 0.35

axes[0].bar(x - width/2, lr_rmse, width, label='Linear Regression', color='blue')
axes[0].bar(x + width/2, rf_rmse, width, label='Random Forest', color='green')
axes[0].set_ylabel('RMSE')
axes[0].set_title('Root Mean Squared Error')
axes[0].set_xticks(x)
axes[0].set_xticklabels(datasets, rotation=15, ha='right')
axes[0].legend()
axes[0].grid(axis='y', alpha=0.3)

# MAE comparison
lr_mae = results_df['lr_mae'].tolist()
rf_mae = results_df['rf_mae'].tolist()

axes[1].bar(x - width/2, lr_mae, width, label='Linear Regression', color='blue')
axes[1].bar(x + width/2, rf_mae, width, label='Random Forest', color='green')
axes[1].set_ylabel('MAE')
axes[1].set_title('Mean Absolute Error')
axes[1].set_xticks(x)
axes[1].set_xticklabels(datasets, rotation=15, ha='right')
axes[1].legend()
axes[1].grid(axis='y', alpha=0.3)

# R² comparison
lr_r2 = results_df['lr_r2'].tolist()
rf_r2 = results_df['rf_r2'].tolist()

axes[2].bar(x - width/2, lr_r2, width, label='Linear Regression', color='blue')
axes[2].bar(x + width/2, rf_r2, width, label='Random Forest', color='green')
axes[2].set_ylabel('R² Score')
axes[2].set_title('R² (Coefficient of Determination)')
axes[2].set_xticks(x)
axes[2].set_xticklabels(datasets, rotation=15, ha='right')
axes[2].legend()
axes[2].grid(axis='y', alpha=0.3)
axes[2].set_ylim([0, 1])

plt.tight_layout()
plt.show()

print("\nInterpretation:")
print("  - Lower RMSE and MAE indicate better predictions")
print("  - Higher R² indicates better model fit (closer to 1.0 is better)")
print("  - Compare imputation methods to see which preserves predictive power")

### Imputation Analysis Summary

**Feature Selection Decision:**
- Removed features with >50% missing data to avoid unreliable imputations
- Retained features with ≤50% missing data for KNN and MICE imputation
- This threshold balances information retention with imputation quality

**KNN Imputation:**
- Uses k-nearest neighbors (k=5) to impute missing values
- Finds similar records based on available features
- Simple, intuitive, and preserves local patterns
- May not capture complex feature relationships

**MICE Imputation:**
- Multiple Imputation by Chained Equations
- Iteratively models each feature based on others
- Captures complex relationships between features
- More sophisticated but computationally intensive

**Distribution Comparison:**
- Both methods generally preserve the original distribution shape
- KNN tends to create discrete "clusters" of imputed values
- MICE typically produces smoother, more continuous distributions
- Check if means and standard deviations remain similar to original

**Model Performance:**
- Random Forest generally outperforms Linear Regression
- Imputed datasets allow using full dataset vs. complete cases only
- Compare R² scores to assess which imputation method better preserves predictive relationships
- Lower error metrics (RMSE, MAE) indicate better imputation quality

**Key Insights:**
- Both imputation methods enable using the full dataset for modeling
- The "best" method depends on whether it preserves predictive power
- MICE often performs better when features have complex interdependencies
- KNN may be preferable for simpler, more interpretable results

### 4. Conclusions & Thoughts [3]

**Anomaly Detection Methods:**

*IQR (Statistical Method):*

**Pros:**
- Simple and highly interpretable - easy to explain to stakeholders
- Robust to distribution assumptions (doesn't require normality)
- Fast computation, suitable for large datasets
- Provides clear, quantifiable boundaries for outliers
- Each feature analyzed independently, making it easy to trace outliers

**Cons:**
- Univariate approach misses multivariate outliers (combinations of normal values that are unusual together)
- Fixed 1.5 * IQR threshold may not be appropriate for all domains
- Can be overly sensitive in heavily skewed distributions
- Treats each feature equally without considering importance

*Isolation Forest (Algorithmic Method):*

**Pros:**
- Detects complex, multivariate outliers that statistical methods miss
- No assumptions about data distribution required
- Scales well to high-dimensional data
- Efficient algorithm based on tree structures
- Considers feature interactions automatically

**Cons:**
- "Black box" nature makes results harder to interpret
- Requires complete cases (cannot handle missing values directly)
- Contamination parameter requires domain knowledge or experimentation
- Results can vary with random seed (though we used random_state=42 for reproducibility)
- More computationally intensive than statistical methods

**Challenges in Anomaly Detection Implementation:**

1. **Missing Data Conflict**: Isolation Forest requires complete cases, but our dataset has missing values. We had to either impute first (introducing bias) or work with reduced data.

2. **Domain Knowledge Gap**: Determining what constitutes a "true" outlier in housing data is challenging. A $10M mansion might be an outlier statistically but is legitimate data.

3. **Method Agreement**: The two methods identified different sets of outliers. Deciding which to trust or how to reconcile them required judgment.

4. **Threshold Selection**: For IQR, the 1.5 multiplier is convention but arbitrary. For Isolation Forest, choosing contamination=0.1 was informed but still somewhat arbitrary.

5. **Action Decision**: Once outliers are identified, deciding whether to remove, transform, or keep them is difficult. Removal risks losing valuable information; keeping them risks model bias.

**Imputation Methods:**

*KNN Imputation:*

**Pros:**
- Intuitive concept: use similar records to fill gaps
- Preserves local patterns and relationships in data
- Non-parametric (no distribution assumptions)
- Works well when similar records exist in dataset
- Relatively simple to implement and explain

**Cons:**
- Computationally expensive for large datasets (distance calculations)
- Sensitive to choice of k (number of neighbors)
- Can create "discrete" imputed values (copies of neighbor values)
- Struggles with high-dimensional data (curse of dimensionality)
- Doesn't account for uncertainty in imputed values

*MICE (Multiple Imputation by Chained Equations):*

**Pros:**
- Sophisticated approach that models feature relationships
- Captures complex, multivariate dependencies
- Produces more realistic, continuous imputed values
- Iterative refinement improves imputation quality
- Theoretically sound framework with statistical guarantees

**Cons:**
- More complex to understand and explain
- Computationally intensive (multiple iterations)
- Can propagate modeling errors through iterations
- Requires careful choice of iteration count
- May overfit to training data patterns

**Challenges in Imputation Implementation:**

1. **Feature Selection Dilemma**: Deciding which features to drop versus impute involved trade-offs. The 50% threshold was reasonable but still somewhat arbitrary.

2. **Evaluation Difficulty**: Without ground truth (the original values), assessing imputation quality is challenging. We used proxy measures (distribution preservation, model performance) but these aren't perfect.

3. **Missing Data Mechanism**: We assumed MAR (Missing At Random) as stated in the dataset name, but if data is MNAR (Missing Not At Random), our imputations could be biased.

4. **Computational Resources**: Both methods, especially MICE with multiple iterations, required significant computation time on larger datasets.

5. **Validation Challenge**: How do we know if imputed values are "good"? We used model performance as a proxy, but this is indirect validation.

6. **Outlier-Imputation Interaction**: Should we remove outliers before or after imputation? Outliers can influence imputation, but imputation can also create or remove outliers.

**Overall Conclusions:**

- **No single "best" method** exists; the choice depends on data characteristics, computational resources, and interpretability needs
- **Statistical methods** (IQR) are valuable for transparency and speed but may miss complex patterns
- **Algorithmic methods** (Isolation Forest, MICE) capture complexity but sacrifice interpretability
- **Combining methods** (as we did) provides both breadth and confidence in results
- **Domain knowledge** is crucial for making informed decisions about outliers and imputation strategies
- **Validation** should always include downstream tasks (like regression) to ensure data quality improvements translate to model improvements

For this housing dataset, MICE likely performed better due to complex feature interdependencies (location, size, price, etc.), while KNN provided more interpretable results. The hybrid approach of flagging outliers without removing them, combined with robust imputation, balanced data quality with information preservation.