# Business Problem Understanding

**Context**  
Dataset yang digunakan berisi informasi perumahan yang berasal dari sensus California tahun 1990.Data ini mencakup berbagai fitur seperti koordinat geografis (kolom longitude dan latitude), karakteristik rumah (kolom housing_median_age, total_rooms), informasi demografis (kolom population, households, median_income), dan target prediksi yaitu median_house_value.

Pasar perumahan di California sangat dinamis, dengan variasi harga yang besar antara kawasan pantai, pinggiran kota, dan pedalaman. Kondisi ini menyulitkan pembeli, penjual, agen properti, dan investor untuk menentukan nilai pasar yang wajar tanpa alat analisis modern. Seperti halnya tantangan yang dihadapi tuan rumah Airbnb dalam menentukan harga sewa yang optimal, pelaku pasar properti di California membutuhkan solusi yang andal untuk memahami valuasi rumah. Dengan dataset California Housing ini, kita dapat menganalisis faktor-faktor yang memengaruhi harga perumahan dan membangun model prediktif yang memberikan wawasan berharga bagi pemilik rumah, pengembang properti, dan investor, terutama di tengah kompetisi pasar yang semakin sengit saat ini.

**Problem Statement**

Salah satu masalah utama di industri properti adalah sulitnya memprediksi harga rumah dengan tepat berdasarkan berbagai fitur, yang menjadi kunci untuk menciptakan model yang menguntungkan secara finansial bagi semua pihak terlibat—penjual, pembeli, dan agen properti. Tantangan ini semakin terasa di pasar yang beragam seperti California, di mana harga perumahan dapat berfluktuasi drastis tergantung pada lokasi, kondisi properti, dan faktor demografis yang unik di setiap wilayah.

Mengingat dataset menyediakan beragam informasi tentang properti, termasuk koordinat lokasi (`longitude` dan `latitude`), usia perumahan (`housing_median_age`), jumlah kamar (`total_rooms`), data populasi (`population`), jumlah rumah tangga (`households`), pendapatan median (`median_income`), serta kedekatan dengan laut, pengembangan model prediksi yang akurat menjadi sangat krusial namun penuh dengan kompleksitas. Para profesional properti dan calon pembeli sering kali kekurangan panduan jelas tentang bagaimana kombinasi faktor-faktor ini—mulai dari jarak ke pusat kota hingga tren demografis—berdampak pada nilai properti, terutama di tengah persaingan pasar yang ketat.

Dengan semakin banyaknya variabel yang memengaruhi harga perumahan, menentukan valuasi yang tepat di pasar California yang kompetitif menjadi kebutuhan mendesak. Tanpa alat prediksi yang dapat diandalkan, penjual berisiko menetapkan harga terlalu tinggi sehingga properti lama terjual atau terlalu rendah sehingga kehilangan peluang keuntungan, seperti yang sering dialami pemilik rumah di daerah premium seperti San Francisco. Sebaliknya, pembeli bisa kesulitan menilai apakah harga yang ditawarkan sesuai dengan kondisi properti, lokasi strategis, atau fasilitas sekitar, terutama di tengah inflasi dan permintaan yang terus meningkat pada tahun-tahun selanjutnya.

# Data Preparation

## Library

In [None]:
%%capture
!pip install jcopml
!pip install category_encoders
!pip install imblearn

In [None]:
import pandas as pd
import numpy as np

from scipy import stats
from scipy.stats import uniform, randint

import matplotlib.pyplot as plt
import seaborn as sns
import folium
from folium.plugins import MarkerCluster

from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.preprocessing import StandardScaler, OneHotEncoder, RobustScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.utils import resample

from sklearn.linear_model import LinearRegression, Lasso, Ridge, ElasticNet
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, IsolationForest
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.neural_network import MLPRegressor
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor

from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score, make_scorer
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, KFold

## Load Data

In [None]:
!gdown 1NJ7DsgZ3zIdZWTz17RQWgbtDBuk1JVg3

Downloading...
From: https://drive.google.com/uc?id=1NJ7DsgZ3zIdZWTz17RQWgbtDBuk1JVg3
To: /content/data_california_house.csv
  0% 0.00/1.01M [00:00<?, ?B/s]100% 1.01M/1.01M [00:00<00:00, 61.7MB/s]


In [None]:
df = pd.read_csv("data_california_house.csv")
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,ocean_proximity,median_house_value
0,-119.79,36.73,52.0,112.0,28.0,193.0,40.0,1.975,INLAND,47500.0
1,-122.21,37.77,43.0,1017.0,328.0,836.0,277.0,2.2604,NEAR BAY,100000.0
2,-118.04,33.87,17.0,2358.0,396.0,1387.0,364.0,6.299,<1H OCEAN,285800.0
3,-118.28,34.06,17.0,2518.0,1196.0,3051.0,1000.0,1.7199,<1H OCEAN,175000.0
4,-119.81,36.73,50.0,772.0,194.0,606.0,167.0,2.2206,INLAND,59200.0


| Fitur              | Deskripsi                                                                 |
|--------------------|---------------------------------------------------------------------------|
| **longitude**|  Ukuran seberapa jauh ke arah barat sebuah rumah di sebelah barat.                                                      |
| **latitude**| Ukuran seberapa jauh ke utara sebuah rumah.                                             |
| **housing_median_age**| Usia rata-rata sebuah rumah dalam satu blok; angka yang lebih rendah adalah bangunan yang lebih baru.                                          |
| **total_rooms**|  Jumlah total kamar tidur dalam satu blok.|
| **population**|  Jumlah total orang yang tinggal di dalam blok.                                        |
| **households**| Jumlah total rumah tangga, sekelompok orang yang tinggal di dalam satu unit rumah, untuk satu blok.                                           |
| **median_income**| Pendapatan rata-rata rumah tangga di dalam satu blok rumah (diukur dalam puluhan ribu Dolar AS).                                       |
| **ocean_proximity**| Lokasi rumah yang dekat dengan samudra/laut.                              |
| **median_house_value**|  Nilai rata-rata rumah untuk rumah tangga dalam satu blok (diukur dalam Dolar AS).           |




Penjelasan tentang isi dari kolom ocean_proximity.

| Isi Data              | Penjelasan                                                                 |
|--------------------|---------------------------------------------------------------------------|
| **ISLAND**|  Ini menunjukkan rumah tersebut terletak di sebuah pulau.                                                      |
| **<1H OCEAN**| Ini menandakan rumah tersebut berjarak kurang dari satu jam dari laut.                                             |
| **NEAR OCEAN**| Rumah tersebut lebih dekat dengan lautan ketimbang `<1H OCEAN.                                          |
| **NEAR BAY**|  Rumah tersebut berada di dekat teluk, yang merupakan perairan yang terhubung ke lautan atau danau.|
| **INLAND**|  Rumah tersebut tidak berada di dekat pantai, melainkan berada di pedalaman.                                        |

In [None]:
df.describe()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
count,14448.0,14448.0,14448.0,14448.0,14311.0,14448.0,14448.0,14448.0,14448.0
mean,-119.566647,35.630093,28.618702,2640.132683,538.260709,1425.157323,499.508929,3.866667,206824.624516
std,2.006587,2.140121,12.596694,2191.612441,423.577544,1149.580157,383.09839,1.891158,115365.476182
min,-124.35,32.54,1.0,2.0,1.0,3.0,1.0,0.4999,14999.0
25%,-121.8,33.93,18.0,1451.0,295.0,784.0,279.0,2.5706,119600.0
50%,-118.49,34.26,29.0,2125.0,435.0,1165.0,410.0,3.5391,180000.0
75%,-118.0,37.71,37.0,3148.0,647.0,1724.0,604.0,4.7361,263900.0
max,-114.31,41.95,52.0,32627.0,6445.0,35682.0,6082.0,15.0001,500001.0


## Duplicate Check

In [None]:
df.duplicated().sum() # Tidak ada data yang duplikat.

np.int64(0)

## Missing Value Handling

In [None]:
df.isna().sum() # Mendeteksi missing value.

Unnamed: 0,0
longitude,0
latitude,0
housing_median_age,0
total_rooms,0
total_bedrooms,137
population,0
households,0
median_income,0
ocean_proximity,0
median_house_value,0


Terdapat 137 missing value pada kolom total_bedroom.

In [None]:
# mengecek persenan missing value pada kolom total_bedrooms terhadap keseluruhan total data
missing_value = (137/14445) *100
missing_value

0.9484250605745932

Missing Valuenya hanya sekitar 0.94% (dibawah 1 persen), maka dari itu saya memutuskan untuk menghapus missing value tersebut.

In [None]:
df = df.dropna()

## Feature Engineering : Add Features

### Ratio Based Features

Membuat kolom baru berdasarkan kolom-kolom yang sudah ada untuk membantu saya  untuk analisis dan pemodelan prediksi harga rumah.

In [None]:
df['rooms_per_household'] = df['total_rooms'] / df['households'] # Rata-rata jumlah kamar per rumah tangga
df['population_per_household'] = df['population'] / df['households'] # Kepadatan penghuni per rumah tangga
df['bedroom_ratio'] = df['total_bedrooms'] / df['total_rooms'] # Rata-rata jumlah kamar tidur per kamar
df['income_per_person'] = df['median_income'] / (df['population'] / df['households']) # Pendapatan per kapita
df['is_coastal'] = df['ocean_proximity'].apply(lambda x: 1 if x == 'NEAR OCEAN' or x == 'NEAR BAY' else 0) # Indikator lokasi pesisir (jika ocean_proximity adalah data kategorikal
df['house_age_category'] = pd.cut(df['housing_median_age'], bins=[0, 10, 20, 30, 40, 52], labels=['Sangat Baru', 'Baru', 'Sedang', 'Tua', 'Sangat Tua']) # Kategori usia rumah

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['rooms_per_household'] = df['total_rooms'] / df['households'] # Rata-rata jumlah kamar per rumah tangga
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['population_per_household'] = df['population'] / df['households'] # Kepadatan penghuni per rumah tangga
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-

In [None]:
df['housing_median_age'].describe()

Unnamed: 0,housing_median_age
count,14311.0
mean,28.609671
std,12.606493
min,1.0
25%,18.0
50%,29.0
75%,37.0
max,52.0


In [None]:
# cek  missing value
df.isna().sum()

Unnamed: 0,0
longitude,0
latitude,0
housing_median_age,0
total_rooms,0
total_bedrooms,0
population,0
households,0
median_income,0
ocean_proximity,0
median_house_value,0


In [None]:
# melihat tipe data df
df.dtypes

Unnamed: 0,0
longitude,float64
latitude,float64
housing_median_age,float64
total_rooms,float64
total_bedrooms,float64
population,float64
households,float64
median_income,float64
ocean_proximity,object
median_house_value,float64


### Binning (Income Level)

In [None]:
# Berdasarkan bins
bins = [0, 3, 7, float('inf')]
labels = [1, 2, 3]  #  untuk rendah, 2 untuk rata-rata, 3 untuk tinggi

# Membuat fitur income_level
df['income_level'] = pd.cut(df['median_income'], bins=bins, labels=labels, right=False)
df['income_level'] = df['income_level'].astype(int)
df

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,ocean_proximity,median_house_value,rooms_per_household,population_per_household,bedroom_ratio,income_per_person,is_coastal,house_age_category,income_level
0,-119.79,36.73,52.0,112.0,28.0,193.0,40.0,1.9750,INLAND,47500.0,2.800000,4.825000,0.250000,0.409326,0,Sangat Tua,1
1,-122.21,37.77,43.0,1017.0,328.0,836.0,277.0,2.2604,NEAR BAY,100000.0,3.671480,3.018051,0.322517,0.748960,1,Sangat Tua,1
2,-118.04,33.87,17.0,2358.0,396.0,1387.0,364.0,6.2990,<1H OCEAN,285800.0,6.478022,3.810440,0.167939,1.653090,0,Baru,2
3,-118.28,34.06,17.0,2518.0,1196.0,3051.0,1000.0,1.7199,<1H OCEAN,175000.0,2.518000,3.051000,0.474980,0.563717,0,Baru,1
4,-119.81,36.73,50.0,772.0,194.0,606.0,167.0,2.2206,INLAND,59200.0,4.622754,3.628743,0.251295,0.611948,0,Sangat Tua,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14442,-120.06,36.94,19.0,901.0,183.0,700.0,190.0,2.2375,INLAND,64300.0,4.742105,3.684211,0.203108,0.607321,0,Baru,1
14443,-121.26,38.27,20.0,1314.0,229.0,712.0,219.0,4.4125,INLAND,144600.0,6.000000,3.251142,0.174277,1.357216,0,Baru,2
14444,-120.89,37.48,27.0,1118.0,195.0,647.0,209.0,2.9135,INLAND,159400.0,5.349282,3.095694,0.174419,0.941146,0,Sedang,1
14446,-117.93,33.62,34.0,2125.0,498.0,1052.0,468.0,5.6315,<1H OCEAN,484600.0,4.540598,2.247863,0.234353,2.505268,0,Tua,2


## New Data Summary

In [None]:
df.describe()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,rooms_per_household,population_per_household,bedroom_ratio,income_per_person,is_coastal,income_level
count,14311.0,14311.0,14311.0,14311.0,14311.0,14311.0,14311.0,14311.0,14311.0,14311.0,14311.0,14311.0,14311.0,14311.0,14311.0
mean,-119.56715,35.631365,28.609671,2640.917686,538.260709,1424.772273,499.48047,3.866774,206793.156942,5.42552,3.04125,0.212895,1.408692,0.242121,1.698693
std,2.006374,2.139589,12.606493,2197.192896,423.577544,1151.795857,383.826005,1.890866,115404.371629,2.294973,6.900639,0.058188,0.751835,0.428382,0.569598
min,-124.35,32.54,1.0,2.0,1.0,3.0,1.0,0.4999,14999.0,0.846154,0.75,0.1,0.008486,0.0,1.0
25%,-121.8,33.93,18.0,1452.0,295.0,784.0,279.0,2.5694,119400.0,4.453362,2.42651,0.175575,0.869561,0.0,1.0
50%,-118.49,34.26,29.0,2125.0,435.0,1164.0,410.0,3.5403,180000.0,5.230769,2.816092,0.202886,1.296942,0.0,2.0
75%,-118.0,37.715,37.0,3142.0,647.0,1722.0,603.5,4.7361,263750.0,6.047714,3.280652,0.239186,1.780154,0.0,2.0
max,-114.31,41.95,52.0,32627.0,6445.0,35682.0,6082.0,15.0001,500001.0,132.533333,599.714286,1.0,7.462508,1.0,3.0


In [None]:
# Features Description
feature_descriptions = []

# Loop through each column in the DataFrame
for column in df.columns:
    # Calculate the number of null values and their percentage
    null_count = df[column].isna().sum()
    null_percentage = round((null_count / len(df[column])) * 100, 2)

    # Get the number of unique values and a sample of unique values
    unique_count = df[column].nunique()
    unique_sample = list(df[column].drop_duplicates().sample(min(2, unique_count)).values)

    # Append the feature description to the list
    feature_descriptions.append([
        column,
        df[column].dtype,
        null_count,
        null_percentage,
        unique_count,
        unique_sample
    ])

# Create a DataFrame to summarize the features
features_summary = pd.DataFrame(
    feature_descriptions,
    columns=['Data Feature',
             'Data Type',
             'Null Count',
             'Null Percentage (%)',
             'Unique Count',
             'Unique Sample']
)
features_summary

Unnamed: 0,Data Feature,Data Type,Null Count,Null Percentage (%),Unique Count,Unique Sample
0,longitude,float64,0,0.0,806,"[-120.09, -121.3]"
1,latitude,float64,0,0.0,835,"[34.33, 37.89]"
2,housing_median_age,float64,0,0.0,52,"[36.0, 52.0]"
3,total_rooms,float64,0,0.0,5213,"[7962.0, 1433.0]"
4,total_bedrooms,float64,0,0.0,1748,"[520.0, 1849.0]"
5,population,float64,0,0.0,3491,"[2189.0, 7604.0]"
6,households,float64,0,0.0,1646,"[516.0, 2342.0]"
7,median_income,float64,0,0.0,9726,"[4.6765, 4.0526]"
8,ocean_proximity,object,0,0.0,5,"[INLAND, ISLAND]"
9,median_house_value,float64,0,0.0,3540,"[43000.0, 25000.0]"
