# HOUSE PRICE PREDICTION SYSTEM
## Part A - Model Development
### Student: Ogah Victor (22CG031902)
### Algorithm: Random Forest Regressor
### Dataset: House Prices: Advanced Regression Techniques (Kaggle)
---

## Step 1: Install Required Libraries

In [None]:
# Install required packages
!pip install pandas numpy scikit-learn joblib kaggle

## Step 2: Import Libraries

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import joblib
import warnings
warnings.filterwarnings('ignore')

print("✓ All libraries imported successfully")

## Step 3: Download Dataset from Kaggle
**NOTE:** For Colab, upload kaggle.json first or use direct CSV download

In [None]:
# Option A: Direct download (No Kaggle API needed)
import urllib.request
import io

# Download from a public source
url = 'https://raw.githubusercontent.com/datascienceactually/House-Prices-Advanced-Regression-Techniques/master/train.csv'
df = pd.read_csv(url)

print(f"✓ Dataset downloaded successfully")
print(f"Dataset shape: {df.shape}")
print(f"\nFirst few rows:\n{df.head()}")

## Step 4: Data Preprocessing
### Selected Features (6 out of 9 recommended):
1. **OverallQual** - Overall quality rating (1-10)
2. **GrLivArea** - Above ground living area (sq ft)
3. **TotalBsmtSF** - Total basement area (sq ft)
4. **GarageCars** - Number of cars garage can hold
5. **YearBuilt** - Year house was built
6. **FullBath** - Number of full bathrooms

### Why these 6?
- Strong correlation with SalePrice
- Minimal missing values
- Easy to interpret (important for exam prep)

In [None]:
# STEP 4A: Feature Selection
# Select the 6 features and target variable
selected_features = ['OverallQual', 'GrLivArea', 'TotalBsmtSF', 'GarageCars', 'YearBuilt', 'FullBath']
target = 'SalePrice'

# Create new dataframe with selected features
df_selected = df[selected_features + [target]].copy()

print(f"Selected Features: {selected_features}")
print(f"Target Variable: {target}")
print(f"\nDataset shape after feature selection: {df_selected.shape}")
print(f"\nData types:\n{df_selected.dtypes}")

In [None]:
# STEP 4B: Handle Missing Values
print("Missing values before handling:")
print(df_selected.isnull().sum())

# Fill missing values with median (numerical columns)
df_selected = df_selected.fillna(df_selected.median())

print("\nMissing values after handling:")
print(df_selected.isnull().sum())
print("\n✓ Missing values handled")

In [None]:
# STEP 4C: Feature Scaling (not strictly necessary for Random Forest, but good practice)
# Check for any categorical variables
print("Data Info:")
print(df_selected.info())
print("\nDescriptive Statistics:")
print(df_selected.describe())

In [None]:
# STEP 4D: Prepare data for modeling
# Separate features and target
X = df_selected[selected_features]
y = df_selected[target]

# Check for any remaining issues
print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")
print(f"\nTarget statistics:\n{y.describe()}")

## Step 5: Split Data into Training and Testing Sets

In [None]:
# Split data: 80% training, 20% testing
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2,      # 20% for testing
    random_state=42     # For reproducibility
)

print(f"Training set size: {X_train.shape[0]} samples")
print(f"Testing set size: {X_test.shape[0]} samples")
print(f"\n✓ Data split completed")

## Step 6: Train Random Forest Model

In [None]:
# Create and train Random Forest Regressor
# Random Forest advantages:
# 1. Handles non-linear relationships well
# 2. Resistant to outliers
# 3. No feature scaling required
# 4. Good generalization

model = RandomForestRegressor(
    n_estimators=100,      # Number of trees
    max_depth=20,          # Max depth of each tree
    min_samples_split=5,   # Minimum samples to split
    min_samples_leaf=2,    # Minimum samples at leaf
    random_state=42,       # For reproducibility
    n_jobs=-1              # Use all CPU cores
)

# Train the model
print("Training Random Forest model...")
model.fit(X_train, y_train)
print("✓ Model training completed!")

## Step 7: Make Predictions

In [None]:
# Make predictions on training and testing sets
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

print("Sample predictions (first 5):")
for i in range(5):
    print(f"Actual: ${y_test.iloc[i]:,.0f} | Predicted: ${y_test_pred[i]:,.0f}")

## Step 8: Evaluate Model Performance
### Regression Metrics Explained:
- **MAE** (Mean Absolute Error): Average absolute difference. Units: dollars
- **MSE** (Mean Squared Error): Average squared difference. Penalizes large errors more
- **RMSE** (Root Mean Squared Error): Square root of MSE. Same units as target (dollars)
- **R²** (Coefficient of Determination): Proportion of variance explained (0-1). Higher is better

In [None]:
# Calculate evaluation metrics for TRAINING data
train_mae = mean_absolute_error(y_train, y_train_pred)
train_mse = mean_squared_error(y_train, y_train_pred)
train_rmse = np.sqrt(train_mse)
train_r2 = r2_score(y_train, y_train_pred)

print("="*50)
print("TRAINING SET PERFORMANCE")
print("="*50)
print(f"MAE:  ${train_mae:,.2f}")
print(f"MSE:  {train_mse:,.2f}")
print(f"RMSE: ${train_rmse:,.2f}")
print(f"R²:   {train_r2:.4f}")

In [None]:
# Calculate evaluation metrics for TESTING data
test_mae = mean_absolute_error(y_test, y_test_pred)
test_mse = mean_squared_error(y_test, y_test_pred)
test_rmse = np.sqrt(test_mse)
test_r2 = r2_score(y_test, y_test_pred)

print("\n" + "="*50)
print("TESTING SET PERFORMANCE")
print("="*50)
print(f"MAE:  ${test_mae:,.2f}")
print(f"MSE:  {test_mse:,.2f}")
print(f"RMSE: ${test_rmse:,.2f}")
print(f"R²:   {test_r2:.4f}")

print("\n" + "="*50)
print("MODEL INTERPRETATION (for exam prep)")
print("="*50)
print(f"✓ The model explains {test_r2*100:.2f}% of price variance")
print(f"✓ On average, predictions are off by ${test_mae:,.2f}")
print(f"✓ Typical prediction error (RMSE): ${test_rmse:,.2f}")

## Step 9: Feature Importance Analysis

In [None]:
# Get feature importance scores
feature_importance = pd.DataFrame({
    'Feature': selected_features,
    'Importance': model.feature_importances_
}).sort_values('Importance', ascending=False)

print("\nFEATURE IMPORTANCE (How much each feature affects prediction):")
print("="*50)
for idx, row in feature_importance.iterrows():
    print(f"{row['Feature']:15} : {row['Importance']*100:6.2f}%")

print("\n[For Exam]: Random Forest calculates importance by measuring")
print("how much each feature decreases impurity in the decision trees.")

## Step 10: Save Model Using Joblib
### Model Persistence:
- **joblib**: Better for large NumPy arrays, faster, more reliable
- **pickle**: Works but slower for large objects
- Both allow model reuse without retraining

In [None]:
# Save the trained model to disk
model_path = '/content/house_price_model.pkl'  # For Colab
# If running locally, use: model_path = './model/house_price_model.pkl'

joblib.dump(model, model_path)
print(f"✓ Model saved successfully at: {model_path}")

## Step 11: Verify Model Can Be Reloaded (Without Retraining)

In [None]:
# Load the model from disk
loaded_model = joblib.load(model_path)
print("✓ Model loaded successfully!")

# Test that loaded model works
test_sample = X_test.iloc[0:1]
prediction = loaded_model.predict(test_sample)[0]
actual = y_test.iloc[0]

print(f"\nVerification test:")
print(f"Actual price: ${actual:,.0f}")
print(f"Predicted price: ${prediction:,.0f}")
print(f"Error: ${abs(actual - prediction):,.0f}")
print("\n✓ Model reloading and prediction working correctly!")

## SUMMARY FOR EXAM PREPARATION

### What You Built:
1. **Data Preprocessing**: Handled missing values, selected 6 optimal features
2. **Algorithm**: Random Forest Regressor (100 decision trees)
3. **Training**: Trained on 80% of data, tested on 20%
4. **Evaluation**: Calculated MAE, MSE, RMSE, and R² metrics
5. **Persistence**: Saved model with joblib for reuse

### Key Concepts:
- **Random Forest**: Ensemble of decision trees, reduces overfitting
- **Train-Test Split**: Evaluates generalization to unseen data
- **Regression Metrics**: Measure prediction accuracy
- **Feature Importance**: Shows which features matter most
- **Joblib**: Serialization format for scikit-learn models

### Model Performance Summary:

In [None]:
print("\n" + "="*60)
print("FINAL MODEL PERFORMANCE SUMMARY")
print("="*60)
print(f"\nTest Set Metrics:")
print(f"  MAE (Mean Absolute Error):      ${test_mae:>12,.2f}")
print(f"  MSE (Mean Squared Error):       {test_mse:>12,.2f}")
print(f"  RMSE (Root Mean Squared Error): ${test_rmse:>12,.2f}")
print(f"  R² (Coefficient of Determination): {test_r2:>6.4f}")
print(f"\nTraining Set Metrics (for comparison):")
print(f"  R² Score: {train_r2:.4f}")
print(f"\n✓ Model is ready for deployment!")
print("="*60)