# Modul 4 Sains Data

Kembali ke [Sains Data](./saindat2024genap.qmd)

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Pada pertemuan kali ini, kita akan membahas tentang salah satu metode *machine learning*, yaitu **regresi**.

Metode regresi yang paling sering digunakan adalah **regresi linier** *(linear regression)*.

Inti sari dari regresi linier adalah, diberikan sekumpulan data (meliputi satu fitur target yang ingin diprediksi, biasa disebut $y$, serta minimal satu variabel bebas), ingin ditemukan garis yang paling mendekati semua titik.

"Paling mendekati" bisa diukur dengan menjumlahkan (kuadrat dari) semua selisih antara nilai $y$ pada tiap titik dengan nilai $y$ pada garis. (Misalkan fungsi garis ditulis $y = P\left(x\right)$. Maka, nilai $y$ pada garis ditulis $P\left(x_i\right)$ untuk titik ke-$i$.)

Jika hasil jumlah ini makin kecil, maka garis makin mendekati titik-titiknya. Hasil jumlah ini disebut ***error***, atau di sini lebih tepatnya **SSE** *(sum of squared errors)*:

$$\text{SSE = } \sum_{i=1}^{n} \left( y_i - P\left(x_i\right) \right)^2$$

Maka, tujuan dari regresi linier adalah menemukan garis yang meminimalkan *error*, yaitu meminimalkan SSE.

Regresi linier umumnya terbagi lagi menjadi dua jenis, yaitu

- regresi linier sederhana *(simple linear regression)* ketika hanya ada satu variabel bebas $x$
- regresi linier berganda *(multiple linear regression)* ketika ada sejumlah variabel bebas $x_1, x_2, \dots, x_n$

Banyak metode regresi lainnya yang sebenarnya dibangun di atas regresi linier, contohnya regresi polinomial *(polynomial regression)*. Intinya sama: mencoba mencari bentuk fungsi tertentu yang paling cocok dengan sekumpulan data yang diberikan, baik untuk urusan deskripsi maupun prediksi.

## Import Dataset

Sebelum mulai, seperti biasa, kita perlu meng-*import* dataset terlebih dahulu.

Untuk praktikum kali ini, kita akan melanjutkan dataset minggu lalu, California Housing Prices, yang sudah kita imputasi. Apabila kalian tidak sempat menyimpan dataset hasil imputasi tersebut sebagai *file* CSV, silakan *download* `housing_modified.csv` berikut:

* [Direct link: housing_modified.csv](./housing_modified.csv)

Kemudian *read* dengan pandas seperti biasa:

In [2]:
df = pd.read_csv("./housing_modified.csv")

In [3]:
df

Unnamed: 0.1,Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,<1H OCEAN,INLAND,ISLAND,NEAR BAY,NEAR OCEAN
0,0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,0.0,0.0,0.0,1.0,0.0
1,1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,0.0,0.0,0.0,1.0,0.0
2,2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,0.0,0.0,0.0,1.0,0.0
3,3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,0.0,0.0,0.0,1.0,0.0
4,4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,0.0,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20635,20635,-121.09,39.48,25.0,1665.0,374.0,845.0,330.0,1.5603,78100.0,0.0,1.0,0.0,0.0,0.0
20636,20636,-121.21,39.49,18.0,697.0,150.0,356.0,114.0,2.5568,77100.0,0.0,1.0,0.0,0.0,0.0
20637,20637,-121.22,39.43,17.0,2254.0,485.0,1007.0,433.0,1.7000,92300.0,0.0,1.0,0.0,0.0,0.0
20638,20638,-121.32,39.43,18.0,1860.0,409.0,741.0,349.0,1.8672,84700.0,0.0,1.0,0.0,0.0,0.0


In [11]:
df = df.drop(df.columns[0], axis=1)

In [12]:
df

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,<1H OCEAN,INLAND,ISLAND,NEAR BAY,NEAR OCEAN
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,0.0,0.0,0.0,1.0,0.0
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,0.0,0.0,0.0,1.0,0.0
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,0.0,0.0,0.0,1.0,0.0
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,0.0,0.0,0.0,1.0,0.0
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,0.0,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20635,-121.09,39.48,25.0,1665.0,374.0,845.0,330.0,1.5603,78100.0,0.0,1.0,0.0,0.0,0.0
20636,-121.21,39.49,18.0,697.0,150.0,356.0,114.0,2.5568,77100.0,0.0,1.0,0.0,0.0,0.0
20637,-121.22,39.43,17.0,2254.0,485.0,1007.0,433.0,1.7000,92300.0,0.0,1.0,0.0,0.0,0.0
20638,-121.32,39.43,18.0,1860.0,409.0,741.0,349.0,1.8672,84700.0,0.0,1.0,0.0,0.0,0.0


Pastikan sudah tidak ada *missing value*:

In [13]:
df.isna().sum()

longitude             0
latitude              0
housing_median_age    0
total_rooms           0
total_bedrooms        0
population            0
households            0
median_income         0
median_house_value    0
<1H OCEAN             0
INLAND                0
ISLAND                0
NEAR BAY              0
NEAR OCEAN            0
dtype: int64

Untuk dataset ini, fitur target utama yang ingin diprediksi adalah harga rumah, yaitu `median_house_value`. Kita bisa memisahkan antara fitur target tersebut, misal $y$, dengan fitur-fitur lainnya, misal $X$ besar.

In [14]:
X = df.drop(columns=["median_house_value"])
y = df[["median_house_value"]]

In [15]:
X

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,<1H OCEAN,INLAND,ISLAND,NEAR BAY,NEAR OCEAN
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,0.0,0.0,0.0,1.0,0.0
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,0.0,0.0,0.0,1.0,0.0
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,0.0,0.0,0.0,1.0,0.0
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,0.0,0.0,0.0,1.0,0.0
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,0.0,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
20635,-121.09,39.48,25.0,1665.0,374.0,845.0,330.0,1.5603,0.0,1.0,0.0,0.0,0.0
20636,-121.21,39.49,18.0,697.0,150.0,356.0,114.0,2.5568,0.0,1.0,0.0,0.0,0.0
20637,-121.22,39.43,17.0,2254.0,485.0,1007.0,433.0,1.7000,0.0,1.0,0.0,0.0,0.0
20638,-121.32,39.43,18.0,1860.0,409.0,741.0,349.0,1.8672,0.0,1.0,0.0,0.0,0.0


In [16]:
y

Unnamed: 0,median_house_value
0,452600.0
1,358500.0
2,352100.0
3,341300.0
4,342200.0
...,...
20635,78100.0
20636,77100.0
20637,92300.0
20638,84700.0


## Train-Test Split

Inti sari dari *machine learning* adalah membuat "model" yang bisa belajar dari pola, dan kemudian bisa menghasilkan prediksi yang akurat berdasarkan pola tersebut.

Sehingga, untuk menguji apakah model kita sudah bagus, fokus kita adalah menguji **seberapa baik model bisa memprediksi**.

* Di satu sisi, model *machine learning* memerlukan data, yang dengan data tersebut, model akan terbentuk dengan "latihan", mencoba memahami pola yang ada di data tersebut.
* Di sisi lain, untuk menguji kemampuan model memprediksi, perlu ada juga data acuan sehingga hasil prediksi model bisa dibandingkan dengan data aslinya (yaitu data acuan tersebut).

Data untuk "latihan" disebut data *training* ***(training data)***, dan data acuan untuk menguji kemampuan prediksi disebut data *testing* ***(test data)***.

Tentunya, kedua data ini harus saling lepas (tidak memiliki irisan), agar tidak terjadi yang namanya *data leakage*. Semisal ada data *training* yang sama persis muncul di data *testing*, kan prediksinya jadi hafalan doang, kegampangan :D

Sebenarnya, regresi tidak terbatas *machine learning*. Kebetulan, regresi juga menjadi pembahasan yang mendalam di kalangan statistika, hingga ada mata kuliah tersendiri yang membahas regresi (Model Linier / Model Linear).

Dalam konteks *machine learning*, regresi linier (sebagai model) mencoba mencari garis yang meminimalkan SSE **menggunakan data *training* saja**, yaitu data yang dimaksudkan untuk membentuk model. Kemudian, garis yang ditemukan (model yang terbentuk) akan diuji kemampuan prediksinya menggunakan data *testing*.

Di dunia nyata, data yang kita peroleh biasanya utuh, satu kesatuan. Padahal, untuk menggunakan *machine learning*, kita memerlukan data *training* dan data *testing*. 

Sehingga, dataset yang utuh tersebut bisa kita pecah sendiri menjadi data *training* dan data *testing*, namanya ***train-test split***.

Kebetulan, scikit-learn menyediakan fungsi untuk melakukan *train-test split*. Mari kita coba. Import dulu:

In [5]:
from sklearn.model_selection import train_test_split

Biasanya, dataset dipisah menjadi data *training* sebanyak 80% dan data *testing* sebanyak 20%. Dalam penggunaan fungsi `train_test_split`, ditulis `test_size=0.2`.

Rasio 80-20 ini sebenarnya hanya kebiasaan saja; paling sering begitu, tapi boleh saja misalnya 70-30 atau bahkan 90-10.

Mengapa jauh lebih banyak data *training*? Tujuannya agar model bisa memahami pola pada data dengan lebih mendalam. Namun, perlu hati-hati juga: kalau data *testing* terlalu sedikit, kita kurang bisa menguji kemampuan prediksi model.

Kalau ragu, langsung gunakan saja rasio 80-20. Sepertinya memang sudah standar, digunakan di mana-mana.

In [17]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Apa itu *random state*?

Tentunya, kita berharap bahwa *train-test split* dilakukan secara *random* atau sembarang, yaitu tidak berdasarkan pola tertentu, agar apapun pola yang terkandung dalam data *training* itu kurang lebih juga terkandung dalam data *testing*.

Di sisi lain, apabila orang lain ingin mencoba model yang kita buat, tentunya kita juga berharap bahwa dia mendapatkan hasil yang sama.

Apabila *train-test split* benar-benar selalu *random* tiap kali dijalankan, kemungkinan hasil yang diperoleh orang lain akan cukup berbeda dengan hasil yang kita peroleh, padahal modelnya sama.

Oleh karena itu, meskipun kita menginginkan *train-test split* dilakukan secara *random*, kita juga menginginkan cara *random* tersebut adalah selalu cara yang sama. Hal ini bisa kita atur dengan memasang nilai `random_state` yang selalu sama.

Biasanya, `random_state` dipasang nilai 42. Namun, itu hanya kebiasaan saja. Apapun boleh, asalkan konsisten.

Mari kita lihat hasilnya:

In [18]:
X_train

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,<1H OCEAN,INLAND,ISLAND,NEAR BAY,NEAR OCEAN
14196,-117.03,32.71,33.0,3126.0,627.0,2300.0,623.0,3.2596,0.0,0.0,0.0,0.0,1.0
8267,-118.16,33.77,49.0,3382.0,787.0,1314.0,756.0,3.8125,0.0,0.0,0.0,0.0,1.0
17445,-120.48,34.66,4.0,1897.0,331.0,915.0,336.0,4.1563,0.0,0.0,0.0,0.0,1.0
14265,-117.11,32.69,36.0,1421.0,367.0,1418.0,355.0,1.9425,0.0,0.0,0.0,0.0,1.0
2271,-119.80,36.78,43.0,2382.0,431.0,874.0,380.0,3.5542,0.0,1.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
11284,-117.96,33.78,35.0,1330.0,201.0,658.0,217.0,6.3700,1.0,0.0,0.0,0.0,0.0
11964,-117.43,34.02,33.0,3084.0,570.0,1753.0,449.0,3.0500,0.0,1.0,0.0,0.0,0.0
5390,-118.38,34.03,36.0,2101.0,569.0,1756.0,527.0,2.9344,1.0,0.0,0.0,0.0,0.0
860,-121.96,37.58,15.0,3575.0,597.0,1777.0,559.0,5.7192,1.0,0.0,0.0,0.0,0.0


In [19]:
X_test

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,<1H OCEAN,INLAND,ISLAND,NEAR BAY,NEAR OCEAN
20046,-119.01,36.06,25.0,1505.0,537.870553,1392.0,359.0,1.6812,0.0,1.0,0.0,0.0,0.0
3024,-119.46,35.14,30.0,2943.0,537.870553,1565.0,584.0,2.5313,0.0,1.0,0.0,0.0,0.0
15663,-122.44,37.80,52.0,3830.0,537.870553,1310.0,963.0,3.4801,0.0,0.0,0.0,1.0,0.0
20484,-118.72,34.28,17.0,3051.0,537.870553,1705.0,495.0,5.7376,1.0,0.0,0.0,0.0,0.0
9814,-121.93,36.62,34.0,2351.0,537.870553,1063.0,428.0,3.7250,0.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
15362,-117.22,33.36,16.0,3165.0,482.000000,1351.0,452.0,4.6050,1.0,0.0,0.0,0.0,0.0
16623,-120.83,35.36,28.0,4323.0,886.000000,1650.0,705.0,2.7266,0.0,0.0,0.0,0.0,1.0
18086,-122.05,37.31,25.0,4111.0,538.000000,1585.0,568.0,9.2298,1.0,0.0,0.0,0.0,0.0
2144,-119.76,36.77,36.0,2507.0,466.000000,1227.0,474.0,2.7850,0.0,1.0,0.0,0.0,0.0


In [20]:
y_train

Unnamed: 0,median_house_value
14196,103000.0
8267,382100.0
17445,172600.0
14265,93400.0
2271,96500.0
...,...
11284,229200.0
11964,97800.0
5390,222100.0
860,283500.0


In [21]:
y_test

Unnamed: 0,median_house_value
20046,47700.0
3024,45800.0
15663,500001.0
20484,218600.0
9814,278000.0
...,...
15362,263300.0
16623,266800.0
18086,500001.0
2144,72300.0


## Regresi Linier Sederhana