<a href="https://colab.research.google.com/github/Veraeze/AmineRegeneration/blob/main/realestateRF_target.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Install required libraries (if needed)
!pip install pandas numpy matplotlib seaborn xgboost
!pip install -U scikit-learn


In [None]:
# --- Imports ---
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import joblib
import os

from google.colab import drive
from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.ensemble import RandomForestRegressor
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

%matplotlib inline

In [None]:
#  Mount google drive

drive.mount('/content/drive')

Mounted at /content/drive


# Real Estate Price Prediction in Nigeria

**Objective:**  
Predict property prices in Nigeria using machine learning models based on features such as number of bedrooms, bathrooms, toilets, parking space, property type, town, and state.

---

## Dataset
- **Source:** Nigerian Houses Listings Dataset  
- **Rows:** 24,326  
- **Columns (Features):**  
  - `bedrooms` → Number of bedrooms  
  - `bathrooms` → Number of bathrooms  
  - `toilets` → Number of toilets  
  - `parking_space` → Parking availability  
  - `title` → Property type (e.g., apartment, duplex)  
  - `town` → Town or city  
  - `state` → Nigerian state  
- **Target Column:** `price` → Numeric property price

In [None]:
# =========================
# 1. Load Dataset
# =========================
file_path = '/content/drive/MyDrive/MLDatasets/nigeria_houses_data.csv'
df = pd.read_csv(file_path)

# Create folder for saving charts
os.makedirs("assets/realestate/eda", exist_ok=True)

print("Dataset Shape:", df.shape)
print("\nFirst 5 rows:\n", df.head())


Dataset Shape: (24326, 8)

First 5 rows:
    bedrooms  bathrooms  toilets  parking_space                 title     town  \
0       6.0        5.0      5.0            4.0       Detached Duplex  Mabushi   
1       4.0        5.0      5.0            4.0     Terraced Duplexes  Katampe   
2       4.0        5.0      5.0            4.0       Detached Duplex    Lekki   
3       4.0        4.0      5.0            6.0       Detached Duplex     Ajah   
4       4.0        4.0      5.0            2.0  Semi Detached Duplex    Lekki   

   state        price  
0  Abuja  450000000.0  
1  Abuja  800000000.0  
2  Lagos  120000000.0  
3  Lagos   40000000.0  
4  Lagos   75000000.0  


### Data Preprocessing
- Handle missing values by removing or imputing them.
- Encode categorical features (`title`, `town`, `state`) using one-hot encoding.
- Scale numerical features (`bedrooms`, `bathrooms`, `toilets`, `parking_space`) if necessary.

In [None]:
# =========================
# Check for Missing Values & Duplicates
# =========================
print("\nMissing Values:\n", df.isnull().sum())
print("\nDuplicate Rows:", df.duplicated().sum())


Missing Values:
 bedrooms         0
bathrooms        0
toilets          0
parking_space    0
title            0
town             0
state            0
price            0
dtype: int64

Duplicate Rows: 10438


In [None]:
# Drop duplicate rows
df = df.drop_duplicates()

# Confirm removal
print("After removing duplicates:", df.shape)
print("Remaining duplicate rows:", df.duplicated().sum())

After removing duplicates: (13888, 9)


In [None]:
# =========================
# Summary Statistics
# =========================
print("\nSummary Statistics:\n", df.describe())


Summary Statistics:
            bedrooms     bathrooms       toilets  parking_space         price
count  24326.000000  24326.000000  24326.000000   24326.000000  2.432600e+04
mean       4.338814      4.600798      5.176355       4.041725  3.013802e+08
std        1.138497      1.163161      1.226253       1.399936  1.220403e+10
min        1.000000      1.000000      1.000000       1.000000  9.000000e+04
25%        4.000000      4.000000      5.000000       4.000000  5.200000e+07
50%        4.000000      5.000000      5.000000       4.000000  8.500000e+07
75%        5.000000      5.000000      6.000000       4.000000  1.600000e+08
max        9.000000      9.000000      9.000000       9.000000  1.800000e+12


###  Exploratory Data Analysis (EDA)
- Visualize distribution of property prices.
- Analyze relationships between features and the target price.
- Identify key features that influence price using correlation analysis.

In [None]:
# =========================
# Price Distribution
# =========================
plt.figure(figsize=(15,10))
sns.histplot(df["price"], bins=50, kde=True)
plt.title("Price Distribution")
plt.xlabel("Price (₦)")
plt.ylabel("Count")
plt.savefig("assets/realestate/eda/price_distribution.png")
plt.close()

# Log-transformed price (to handle skewness)
df["log_price"] = np.log1p(df["price"])
plt.figure(figsize=(15,10))
sns.histplot(df["log_price"], bins=50, kde=True, color="orange")
plt.title("Log-Transformed Price Distribution")
plt.xlabel("Log Price")
plt.ylabel("Count")
plt.savefig("assets/realestate/eda/log_price_distribution.png")
plt.close()