<a href="https://colab.research.google.com/github/fadhilahmad11/Hands-on-Machine-Learning-with-Scikit-Learn-TensorFlow-Tugas-Machine-LearningW8-W16/blob/main/Chapter_02_End_to_End_Machine_Learning_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Chapter 2: End-to-End Machine Learning Project  

---

## 1. Pendahuluan  

Chapter ini membahas langkah-langkah membangun proyek Machine Learning dari awal hingga akhir menggunakan dataset harga rumah di California. Tujuan proyek adalah membangun model yang mampu memprediksi median harga rumah berdasarkan data sensus distrik.  

---

## 2. Langkah-Langkah Utama  

1. **Look at the Big Picture**  
   Memahami konteks dan tujuan bisnis, yaitu membantu sistem investasi properti dengan prediksi harga rumah.  

2. **Frame the Problem**  
   - Supervised learning  
   - Regression (karena memprediksi nilai)  
   - Batch learning (data cukup kecil untuk muat di memori)  

3. **Select a Performance Measure**  
   Digunakan **Root Mean Square Error (RMSE)**:  
   $$
   \text{RMSE}(X, h) = \sqrt{ \frac{1}{m} \sum_{i=1}^{m} (h(x^{(i)}) - y^{(i)})^2 }
   $$

   dan alternatifnya **Mean Absolute Error (MAE)**:  
   $$
   \text{MAE}(X, h) = \frac{1}{m} \sum_{i=1}^{m} | h(x^{(i)}) - y^{(i)} |
   $$

4. **Get the Data**  
   - Data didownload dari StatLib repository  
   - Data terdiri dari 20.640 baris dengan atribut seperti: `longitude`, `latitude`, `housing_median_age`, `total_rooms`, `population`, `median_income`, `median_house_value`, `ocean_proximity`  

5. **Discover and Visualize the Data to Gain Insights**  
   - Visualisasi distribusi geografis harga rumah  
   - Analisis korelasi (contohnya korelasi `median_income` dengan harga rumah)  

6. **Prepare the Data for Machine Learning Algorithms**  
   - Data cleaning (misalnya, imputasi nilai kosong)  
   - Feature scaling: standardisasi dan normalisasi  
   - Pipeline untuk mengotomasi transformasi  

7. **Select and Train a Model**  
   - Coba model Linear Regression  
   - Coba Decision Tree, Random Forest  

8. **Fine-Tune Your Model**  
   - Grid Search dan Randomized Search untuk hyperparameter  
   - Ensemble methods  

9. **Present and Launch Your Solution**  
   - Evaluasi pada test set  
   - Dokumentasi, monitoring, dan maintenance  

---

## 3. Contoh Kode untuk Mendownload dan Memuat Data  

```python
import os
import tarfile
import urllib.request
import pandas as pd

DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml2/master/"
HOUSING_PATH = os.path.join("datasets", "housing")
HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"

def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
    os.makedirs(housing_path, exist_ok=True)
    tgz_path = os.path.join(housing_path, "housing.tgz")
    urllib.request.urlretrieve(housing_url, tgz_path)
    housing_tgz = tarfile.open(tgz_path)
    housing_tgz.extractall(path=housing_path)
    housing_tgz.close()

def load_housing_data(housing_path=HOUSING_PATH):
    csv_path = os.path.join(housing_path, "housing.csv")
    return pd.read_csv(csv_path)

fetch_housing_data()
housing = load_housing_data()
housing.head()
