# **1. Perkenalan Dataset**


[**California Housing Prices Dataset**](https://www.kaggle.com/datasets/camnugent/california-housing-prices)

1. longitude: A measure of how far west a house is; a higher value is farther west

2. latitude: A measure of how far north a house is; a higher value is farther north

3. housing_median_age: Median age of a house within a block; a lower number is a newer building

4. total_rooms: Total number of rooms within a block

5. total_bedrooms: Total number of bedrooms within a block

6. population: Total number of people residing within a block

7. households: Total number of households, a group of people residing within a home unit, for a block

8. median_income: Median income for households within a block of houses (measured in tens of thousands of US Dollars)

9. median_house_value: Median house value for households within a block (measured in US Dollars)

10. ocean_proximity: Location of the house w.r.t ocean/sea

# **2. Import Library**

Pada tahap ini, Anda perlu mengimpor beberapa pustaka (library) Python yang dibutuhkan untuk analisis data dan pembangunan model machine learning atau deep learning.

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import RandomizedSearchCV

# **3. Memuat Dataset**

Pada tahap ini, Anda perlu memuat dataset ke dalam notebook. Jika dataset dalam format CSV, Anda bisa menggunakan pustaka pandas untuk membacanya. Pastikan untuk mengecek beberapa baris awal dataset untuk memahami strukturnya dan memastikan data telah dimuat dengan benar.

Jika dataset berada di Google Drive, pastikan Anda menghubungkan Google Drive ke Colab terlebih dahulu. Setelah dataset berhasil dimuat, langkah berikutnya adalah memeriksa kesesuaian data dan siap untuk dianalisis lebih lanjut.

Jika dataset berupa unstructured data, silakan sesuaikan dengan format seperti kelas Machine Learning Pengembangan atau Machine Learning Terapan

In [2]:
df = pd.read_csv("../CaliforniaHousing.csv")

# **4. Exploratory Data Analysis (EDA)**

Pada tahap ini, Anda akan melakukan **Exploratory Data Analysis (EDA)** untuk memahami karakteristik dataset.

Tujuan dari EDA adalah untuk memperoleh wawasan awal yang mendalam mengenai data dan menentukan langkah selanjutnya dalam analisis atau pemodelan.

In [3]:
def summarize_df(df):
    print("5 Baris Pertama")
    print(df.head(), "\n")
    
    print("Bentuk DataFrame (Baris, Kolom)")
    print(df.shape, "\n")
    
    print("Info DataFrame")
    print(df.info(), "\n")
    
    print("Jumlah Nilai Kosong per Kolom")
    print(df.isnull().sum(), "\n")
    
    print("Persentase Nilai Kosong per Kolom")
    print(df.isnull().mean() * 100, "\n")
    
    print("Jumlah Duplikasi Baris")
    print(df.duplicated().sum(), "\n")
    
    print("Statistik Deskriptif")
    print(df.describe(include='all'), "\n")

summarize_df(df)

5 Baris Pertama
   longitude  latitude  housing_median_age  total_rooms  total_bedrooms  \
0    -122.23     37.88                41.0        880.0           129.0   
1    -122.22     37.86                21.0       7099.0          1106.0   
2    -122.24     37.85                52.0       1467.0           190.0   
3    -122.25     37.85                52.0       1274.0           235.0   
4    -122.25     37.85                52.0       1627.0           280.0   

   population  households  median_income  median_house_value ocean_proximity  
0       322.0       126.0         8.3252            452600.0        NEAR BAY  
1      2401.0      1138.0         8.3014            358500.0        NEAR BAY  
2       496.0       177.0         7.2574            352100.0        NEAR BAY  
3       558.0       219.0         5.6431            341300.0        NEAR BAY  
4       565.0       259.0         3.8462            342200.0        NEAR BAY   

Bentuk DataFrame (Baris, Kolom)
(20640, 10) 

Info DataFr

# **5. Data Preprocessing**

Pada tahap ini, data preprocessing adalah langkah penting untuk memastikan kualitas data sebelum digunakan dalam model machine learning.

Jika Anda menggunakan data teks, data mentah sering kali mengandung nilai kosong, duplikasi, atau rentang nilai yang tidak konsisten, yang dapat memengaruhi kinerja model. Oleh karena itu, proses ini bertujuan untuk membersihkan dan mempersiapkan data agar analisis berjalan optimal.

Berikut adalah tahapan-tahapan yang bisa dilakukan, tetapi **tidak terbatas** pada:
1. Menghapus atau Menangani Data Kosong (Missing Values)
2. Menghapus Data Duplikat
3. Normalisasi atau Standarisasi Fitur
4. Deteksi dan Penanganan Outlier
5. Encoding Data Kategorikal
6. Binning (Pengelompokan Data)

Cukup sesuaikan dengan karakteristik data yang kamu gunakan yah. Khususnya ketika kami menggunakan data tidak terstruktur.

In [4]:
def preprocess_housing(df):
    df['total_bedrooms'] = df['total_bedrooms'].fillna(df['total_bedrooms'].median())
    
    df['rooms_per_household'] = df['total_rooms'] / df['households']
    df['bedrooms_per_room'] = df['total_bedrooms'] / df['total_rooms']
    df['population_per_household'] = df['population'] / df['households']
    
    df = pd.get_dummies(df, columns=['ocean_proximity'], drop_first=True)
    
    skewed_features = ['total_rooms', 'total_bedrooms', 'population', 'households']
    for feature in skewed_features:
        df[feature] = np.log1p(df[feature])
    
    scaler = StandardScaler()
    numerical_features = ['longitude', 'latitude', 'housing_median_age', 'median_income',
                          'rooms_per_household', 'bedrooms_per_room', 'population_per_household'] + skewed_features
    df[numerical_features] = scaler.fit_transform(df[numerical_features])

    return df

df_preprocessed = preprocess_housing(df)
df_preprocessed.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,rooms_per_household,bedrooms_per_room,population_per_household,ocean_proximity_INLAND,ocean_proximity_ISLAND,ocean_proximity_NEAR BAY,ocean_proximity_NEAR OCEAN
0,-1.327835,1.052548,0.982143,-1.131133,-1.642192,-1.694943,-1.569395,2.344766,452600.0,0.628559,-1.029988,-0.049597,False,False,True,False
1,-1.322844,1.043185,-0.607019,1.651357,1.320043,1.030337,1.449251,2.332238,358500.0,0.327041,-0.888897,-0.092512,False,False,True,False
2,-1.332827,1.038503,1.856182,-0.45031,-1.110094,-1.109604,-1.104849,1.782699,352100.0,1.15562,-1.291686,-0.025843,False,False,True,False
3,-1.337818,1.038503,1.856182,-0.638257,-0.817506,-0.949925,-0.813343,0.932968,341300.0,0.156966,-0.449613,-0.050329,False,False,True,False
4,-1.337818,1.038503,1.856182,-0.31237,-0.57614,-0.933021,-0.583469,-0.012881,342200.0,0.344711,-0.639087,-0.085616,False,False,True,False


In [5]:
X = df_preprocessed.drop('median_house_value', axis=1)
y = df_preprocessed['median_house_value']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

rf = RandomForestRegressor(n_estimators=10, random_state=42)
rf.fit(X_train, y_train)

y_pred_rf = rf.predict(X_test)

rmse_rf = np.sqrt(mean_squared_error(y_test, y_pred_rf))
r2_rf = r2_score(y_test, y_pred_rf)

print("Random Forest RMSE:", rmse_rf)
print("Random Forest R2:", r2_rf)

Random Forest RMSE: 53529.36478574028
Random Forest R2: 0.7813359842656625
