# **EKSPERIMEN MACHINE LEARNING - HOUSE PRICES PREDICTION**

**Author:** Anwar Rohmadi  
**Dataset:** House Prices - Advanced Regression Techniques (Kaggle)  
**Task:** Regression (Prediksi Harga Rumah)  
**Date:** 2024

---

# **1. Perkenalan Dataset**

## 1.1 Sumber Dataset
Dataset ini berasal dari kompetisi Kaggle **"House Prices - Advanced Regression Techniques"**.

**URL:** https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques

## 1.2 Deskripsi Dataset
Dataset berisi **79 variabel** yang menjelaskan berbagai aspek rumah residensial di Ames, Iowa. Target variabel adalah **SalePrice** (harga jual rumah dalam USD).

## 1.3 Karakteristik Dataset
| Aspek | Detail |
|-------|--------|
| Jumlah Sampel (Train) | 1,460 |
| Jumlah Sampel (Test) | 1,459 |
| Jumlah Fitur | 79 |
| Target | SalePrice (Continuous) |
| Task Type | Regression |
| Missing Values | Ada (beberapa kolom) |
| Tipe Data | Numerik + Kategorikal |

# **2. Import Library**

In [None]:
# Data Manipulation
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder

# Metrics
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Utilities
import os
import json
import warnings
warnings.filterwarnings('ignore')

# Settings
pd.set_option('display.max_columns', 100)
plt.style.use('seaborn-v0_8-whitegrid')

print("Libraries imported successfully!")
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")

# **3. Memuat Dataset**

In [None]:
# Konfigurasi
RAW_DIR = "../house_prices_raw"
PROCESSED_DIR = "../house_prices_preprocessing"
TARGET_COL = "SalePrice"
ID_COL = "Id"

# Load dataset
train_df = pd.read_csv(f"{RAW_DIR}/train.csv")
test_df = pd.read_csv(f"{RAW_DIR}/test.csv")

print(f"Train dataset shape: {train_df.shape}")
print(f"Test dataset shape: {test_df.shape}")
print(f"\nTarget column: {TARGET_COL}")

In [None]:
# Preview data
train_df.head()

In [None]:
# Info dataset
train_df.info()

# **4. Exploratory Data Analysis (EDA)**

## 4.1 Statistik Deskriptif

In [None]:
# Statistik deskriptif numerik
train_df.describe()

In [None]:
# Statistik target variable
print("=" * 50)
print("STATISTIK TARGET (SalePrice)")
print("=" * 50)
print(f"Mean: ${train_df[TARGET_COL].mean():,.2f}")
print(f"Median: ${train_df[TARGET_COL].median():,.2f}")
print(f"Std: ${train_df[TARGET_COL].std():,.2f}")
print(f"Min: ${train_df[TARGET_COL].min():,.2f}")
print(f"Max: ${train_df[TARGET_COL].max():,.2f}")
print(f"Skewness: {train_df[TARGET_COL].skew():.4f}")
print(f"Kurtosis: {train_df[TARGET_COL].kurtosis():.4f}")

## 4.2 Distribusi Target Variable

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Histogram
axes[0].hist(train_df[TARGET_COL], bins=50, edgecolor='black', alpha=0.7, color='steelblue')
axes[0].set_xlabel('SalePrice ($)')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Distribution of SalePrice')
axes[0].axvline(train_df[TARGET_COL].mean(), color='red', linestyle='--', label=f'Mean: ${train_df[TARGET_COL].mean():,.0f}')
axes[0].axvline(train_df[TARGET_COL].median(), color='green', linestyle='--', label=f'Median: ${train_df[TARGET_COL].median():,.0f}')
axes[0].legend()

# Log-transformed
axes[1].hist(np.log1p(train_df[TARGET_COL]), bins=50, edgecolor='black', alpha=0.7, color='coral')
axes[1].set_xlabel('Log(SalePrice + 1)')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Distribution of Log-Transformed SalePrice')

plt.tight_layout()
plt.savefig('target_distribution.png', dpi=150, bbox_inches='tight')
plt.show()

print("\nðŸ“Š Insight: Target variable menunjukkan right-skewed distribution.")
print("   Log transformation dapat membantu menormalkan distribusi.")

## 4.3 Analisis Missing Values

In [None]:
# Hitung missing values
missing = train_df.isnull().sum()
missing_pct = (missing / len(train_df)) * 100
missing_df = pd.DataFrame({
    'Missing Count': missing,
    'Missing %': missing_pct
}).sort_values('Missing %', ascending=False)

# Filter kolom dengan missing values
missing_df = missing_df[missing_df['Missing Count'] > 0]

print(f"Jumlah kolom dengan missing values: {len(missing_df)}")
print("\nTop 10 kolom dengan missing values terbanyak:")
missing_df.head(10)

In [None]:
# Visualisasi missing values
plt.figure(figsize=(12, 8))
top_missing = missing_df.head(15)
plt.barh(top_missing.index, top_missing['Missing %'], color='coral')
plt.xlabel('Missing Percentage (%)')
plt.title('Top 15 Features with Missing Values')
plt.gca().invert_yaxis()
for i, v in enumerate(top_missing['Missing %']):
    plt.text(v + 0.5, i, f'{v:.1f}%', va='center')
plt.tight_layout()
plt.savefig('missing_values.png', dpi=150, bbox_inches='tight')
plt.show()

## 4.4 Korelasi Fitur Numerik

In [None]:
# Hitung korelasi dengan target
numeric_cols = train_df.select_dtypes(include=[np.number]).columns
correlations = train_df[numeric_cols].corr()[TARGET_COL].drop(TARGET_COL).sort_values(ascending=False)

print("Top 10 fitur berkorelasi POSITIF dengan SalePrice:")
print(correlations.head(10))
print("\nTop 10 fitur berkorelasi NEGATIF dengan SalePrice:")
print(correlations.tail(10))

In [None]:
# Heatmap korelasi top features
top_corr_features = correlations.abs().nlargest(15).index.tolist()
top_corr_features.append(TARGET_COL)

plt.figure(figsize=(12, 10))
corr_matrix = train_df[top_corr_features].corr()
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
sns.heatmap(corr_matrix, mask=mask, annot=True, fmt='.2f', cmap='coolwarm', 
            center=0, square=True, linewidths=0.5)
plt.title('Correlation Heatmap - Top 15 Features')
plt.tight_layout()
plt.savefig('correlation_heatmap.png', dpi=150, bbox_inches='tight')
plt.show()

## 4.5 Analisis Fitur Kategorikal

In [None]:
# Identifikasi kolom kategorikal
cat_cols = train_df.select_dtypes(include=['object']).columns
print(f"Jumlah fitur kategorikal: {len(cat_cols)}")
print(f"\nFitur kategorikal: {list(cat_cols)}")

In [None]:
# Analisis beberapa fitur kategorikal penting
important_cats = ['Neighborhood', 'OverallQual', 'ExterQual', 'KitchenQual']

fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.flatten()

for i, col in enumerate(important_cats):
    if col in train_df.columns:
        if train_df[col].dtype == 'object':
            order = train_df.groupby(col)[TARGET_COL].median().sort_values(ascending=False).index
            sns.boxplot(data=train_df, x=col, y=TARGET_COL, order=order, ax=axes[i])
        else:
            sns.boxplot(data=train_df, x=col, y=TARGET_COL, ax=axes[i])
        axes[i].set_title(f'{col} vs SalePrice')
        axes[i].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.savefig('categorical_analysis.png', dpi=150, bbox_inches='tight')
plt.show()

## 4.6 Ringkasan EDA

### Key Findings:
1. **Target Variable (SalePrice)**: Right-skewed, range $34,900 - $755,000
2. **Missing Values**: 19 kolom memiliki missing values, PoolQC (99.5%), MiscFeature (96.3%), Alley (93.8%) tertinggi
3. **Top Correlated Features**: OverallQual (0.79), GrLivArea (0.71), GarageCars (0.64)
4. **Kategorikal Penting**: Neighborhood, OverallQual, ExterQual berpengaruh signifikan terhadap harga

# **5. Data Preprocessing**

## 5.1 Handling Missing Values

In [None]:
def handle_missing_values(df):
    """Handle missing values dalam dataset"""
    df = df.copy()
    
    # Numeric: fill with median
    numeric_cols = df.select_dtypes(include=[np.number]).columns
    for col in numeric_cols:
        if df[col].isnull().sum() > 0:
            df[col].fillna(df[col].median(), inplace=True)
    
    # Categorical: fill with mode or 'None'
    cat_cols = df.select_dtypes(include=['object']).columns
    for col in cat_cols:
        if df[col].isnull().sum() > 0:
            mode_val = df[col].mode()
            df[col].fillna(mode_val[0] if len(mode_val) > 0 else 'None', inplace=True)
    
    return df

# Apply
train_processed = handle_missing_values(train_df)
print(f"Missing values setelah handling: {train_processed.isnull().sum().sum()}")

## 5.2 Feature Engineering

In [None]:
def feature_engineering(df):
    """Create new features"""
    df = df.copy()
    
    # Total square footage
    if all(col in df.columns for col in ['TotalBsmtSF', '1stFlrSF', '2ndFlrSF']):
        df['TotalSF'] = df['TotalBsmtSF'] + df['1stFlrSF'] + df['2ndFlrSF']
    
    # Total bathrooms
    bath_cols = ['FullBath', 'HalfBath', 'BsmtFullBath', 'BsmtHalfBath']
    if all(col in df.columns for col in bath_cols):
        df['TotalBath'] = df['FullBath'] + 0.5*df['HalfBath'] + df['BsmtFullBath'] + 0.5*df['BsmtHalfBath']
    
    # House age
    if 'YearBuilt' in df.columns and 'YrSold' in df.columns:
        df['HouseAge'] = df['YrSold'] - df['YearBuilt']
    
    # Remodel age
    if 'YearRemodAdd' in df.columns and 'YrSold' in df.columns:
        df['RemodAge'] = df['YrSold'] - df['YearRemodAdd']
    
    return df

# Apply
train_processed = feature_engineering(train_processed)
new_features = ['TotalSF', 'TotalBath', 'HouseAge', 'RemodAge']
print(f"New features created: {[f for f in new_features if f in train_processed.columns]}")

## 5.3 Encoding Kategorikal

In [None]:
def encode_categorical(df, label_encoders=None):
    """Encode categorical variables using LabelEncoder"""
    df = df.copy()
    cat_cols = df.select_dtypes(include=['object']).columns
    
    if label_encoders is None:
        label_encoders = {}
        for col in cat_cols:
            le = LabelEncoder()
            df[col] = le.fit_transform(df[col].astype(str))
            label_encoders[col] = le
    else:
        for col in cat_cols:
            if col in label_encoders:
                le = label_encoders[col]
                df[col] = df[col].astype(str).apply(
                    lambda x: le.transform([x])[0] if x in le.classes_ else -1
                )
    
    return df, label_encoders

# Separate target
y = train_processed[TARGET_COL].copy()
X = train_processed.drop(columns=[ID_COL, TARGET_COL])

# Apply encoding
X_encoded, encoders = encode_categorical(X)
print(f"Shape after encoding: {X_encoded.shape}")
print(f"All numeric now: {X_encoded.select_dtypes(include=['object']).shape[1] == 0}")

## 5.4 Train/Validation Split

In [None]:
# Split data
TEST_SIZE = 0.2
RANDOM_STATE = 42

X_train, X_val, y_train, y_val = train_test_split(
    X_encoded, y, test_size=TEST_SIZE, random_state=RANDOM_STATE
)

print(f"Training set: {X_train.shape}")
print(f"Validation set: {X_val.shape}")

## 5.5 Feature Scaling

In [None]:
# Apply StandardScaler
scaler = StandardScaler()
X_train_scaled = pd.DataFrame(
    scaler.fit_transform(X_train), 
    columns=X_train.columns, 
    index=X_train.index
)
X_val_scaled = pd.DataFrame(
    scaler.transform(X_val), 
    columns=X_val.columns, 
    index=X_val.index
)

print("Scaling applied successfully!")
print(f"X_train_scaled mean: {X_train_scaled.mean().mean():.6f}")
print(f"X_train_scaled std: {X_train_scaled.std().mean():.6f}")

## 5.6 Save Preprocessed Data

In [None]:
# Save to CSV
import os
os.makedirs(PROCESSED_DIR, exist_ok=True)

X_train_scaled.to_csv(f"{PROCESSED_DIR}/X_train.csv", index=False)
X_val_scaled.to_csv(f"{PROCESSED_DIR}/X_val.csv", index=False)
y_train.to_csv(f"{PROCESSED_DIR}/y_train.csv", index=False)
y_val.to_csv(f"{PROCESSED_DIR}/y_val.csv", index=False)

# Save metadata
metadata = {
    'feature_cols': X_train_scaled.columns.tolist(),
    'n_train': len(X_train_scaled),
    'n_val': len(X_val_scaled),
    'n_features': len(X_train_scaled.columns)
}

with open(f"{PROCESSED_DIR}/metadata.json", 'w') as f:
    json.dump(metadata, f, indent=2)

print("="*50)
print("PREPROCESSING COMPLETE!")
print("="*50)
print(f"Files saved to: {PROCESSED_DIR}/")
print(f"Train samples: {metadata['n_train']}")
print(f"Val samples: {metadata['n_val']}")
print(f"Features: {metadata['n_features']}")

---

# **Ringkasan Preprocessing**

| Step | Deskripsi | Hasil |
|------|-----------|-------|
| 1 | Handle Missing Values | Numeric: median, Categorical: mode |
| 2 | Feature Engineering | 4 fitur baru (TotalSF, TotalBath, HouseAge, RemodAge) |
| 3 | Encoding | Label Encoding untuk 43 kolom kategorikal |
| 4 | Train/Val Split | 80/20 split (1168/292 samples) |
| 5 | Scaling | StandardScaler (mean=0, std=1) |

Dataset siap untuk tahap modelling!

---