# Chapter 2: Proyek Machine Learning dari Awal sampai Akhir

Dalam proyek ini, kita akan membangun model regresi untuk memprediksi harga rumah di California berdasarkan fitur-fitur demografis dan geografis. Dataset diambil dari sumber resmi buku dan digunakan untuk mendemonstrasikan tahapan lengkap proyek Machine Learning.

---

In [None]:
import os
import urllib.request
import tarfile
from pathlib import Path

def unduh_dataset(url, simpan_di):
    Path(simpan_di).mkdir(parents=True, exist_ok=True)
    path_file = os.path.join(simpan_di, "housing.tgz")
    urllib.request.urlretrieve(url, path_file)
    with tarfile.open(path_file) as arsip:
        arsip.extractall(path=simpan_di)

DATASET_URL = "https://raw.githubusercontent.com/ageron/handson-ml2/master/datasets/housing/housing.tgz"
DATASET_FOLDER = os.path.join("data", "california_housing")

unduh_dataset(DATASET_URL, DATASET_FOLDER)

In [None]:
import pandas as pd

def baca_dataset(folder_path):
    file_csv = os.path.join(folder_path, "housing.csv")
    return pd.read_csv(file_csv)

df_housing = baca_dataset(DATASET_FOLDER)
df_housing.head()

## Eksplorasi Data dan Visualisasi

Sebelum membangun model, penting untuk memahami struktur data. Kita akan:
- Melihat informasi umum dataset
- Mengecek distribusi nilai
- Membuat histogram dan peta lokasi


In [None]:
df_housing.info()

In [None]:
df_housing.describe()

In [None]:
df_housing['ocean_proximity'].value_counts()

In [None]:
import matplotlib.pyplot as plt

df_housing.hist(bins=50, figsize=(20,15))
plt.show()

In [None]:
df_housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.4,
                     s=df_housing["population"]/100, label="population", figsize=(10,7),
                     c="median_house_value", cmap="jet", colorbar=True)
plt.legend()
plt.show()